doc/textutils.texi

   1 \input texinfo
   2 @c %**start of header
   3 @setfilename textutils.info
   4 @settitle @sc{gnu} text utilities
   5 @c %**end of header
   6
   7 @include version.texi
   8 @include constants.texi
   9
  10 @c Define new indices.
  11 @defcodeindex op
  12
  13 @c Put everything in one index (arbitrarily chosen to be the concept index).
  14 @syncodeindex fn cp
  15 @syncodeindex ky cp
  16 @syncodeindex op cp
  17 @syncodeindex pg cp
  18 @syncodeindex vr cp
  19
  20 @ifinfo
  21 @format
  22 START-INFO-DIR-ENTRY
  23 * Text utilities: (textutils).                  GNU text utilities.
  24 * cat: (textutils)cat invocation.               Concatenate and write files.
  25 * cksum: (textutils)cksum invocation.           Print @sc{posix} CRC checksum.
  26 * comm: (textutils)comm invocation.             Compare sorted files by line.
  27 * csplit: (textutils)csplit invocation.         Split by context.
  28 * cut: (textutils)cut invocation.               Print selected parts of lines.
  29 * expand: (textutils)expand invocation.         Convert tabs to spaces.
  30 * fmt: (textutils)fmt invocation.               Reformat paragraph text.
  31 * fold: (textutils)fold invocation.             Wrap long input lines.
  32 * head: (textutils)head invocation.             Output the first part of files.
  33 * join: (textutils)join invocation.             Join lines on a common field.
  34 * md5sum: (textutils)md5sum invocation.         Print or check message-digests.
  35 * nl: (textutils)nl invocation.                 Number lines and write files.
  36 * od: (textutils)od invocation.                 Dump files in octal, etc.
  37 * paste: (textutils)paste invocation.           Merge lines of files.
  38 * pr: (textutils)pr invocation.                 Paginate or columnate files.
  39 * ptx: (textutils)ptx invocation.               Produce permuted indexes.
  40 * sort: (textutils)sort invocation.             Sort text files.
  41 * split: (textutils)split invocation.           Split into fixed-size pieces.
  42 * sum: (textutils)sum invocation.               Print traditional checksum.
  43 * tac: (textutils)tac invocation.               Reverse files.
  44 * tail: (textutils)tail invocation.             Output the last part of files.
  45 * tsort: (textutils)tsort invocation.           Topological sort.
  46 * tr: (textutils)tr invocation.                 Translate characters.
  47 * unexpand: (textutils)unexpand invocation.     Convert spaces to tabs.
  48 * uniq: (textutils)uniq invocation.             Uniquify files.
  49 * wc: (textutils)wc invocation.                 Byte, word, and line counts.
  50 END-INFO-DIR-ENTRY
  51 @end format
  52 @end ifinfo
  53
  54 @ifinfo
  55 This file documents the GNU text utilities.
  56
  57 Copyright (C) 1994, 95, 96 Free Software Foundation, Inc.
  58
  59 Permission is granted to copy, distribute and/or modify this document
  60 under the terms of the GNU Free Documentation License, Version 1.1
  61 or any later version published by the Free Software Foundation;
  62 with no Invariant Sections, with no
  63 Front-Cover Texts, and with no Back-Cover Texts.
  64 A copy of the license is included in the section entitled ``GNU
  65 Free Documentation License''.
  66
  67 @end ifinfo
  68
  69 @titlepage
  70 @title @sc{gnu} @code{textutils}
  71 @subtitle A set of text utilities
  72 @subtitle for version @value{VERSION}, @value{UPDATED}
  73 @author David MacKenzie et al.
  74
  75 @page
  76 @vskip 0pt plus 1filll
  77 Copyright @copyright{} 1994, 95, 96 Free Software Foundation, Inc.
  78
  79 Permission is granted to copy, distribute and/or modify this document
  80 under the terms of the GNU Free Documentation License, Version 1.1
  81 or any later version published by the Free Software Foundation;
  82 with no Invariant Sections, with no
  83 Front-Cover Texts, and with no Back-Cover Texts.
  84 A copy of the license is included in the section entitled ``GNU
  85 Free Documentation License''.
  86 @end titlepage
  87
  88
  89 @c If your makeinfo doesn't grok this @ifnottex directive, then either
  90 @c get a newer version of makeinfo or do s/ifnottex/ifinfo/ here and on
  91 @c the matching @end directive below.
  92 @ifnottex
  93 @node Top
  94 @top GNU text utilities
  95
  96 @cindex text utilities
  97 @cindex utilities for text handling
  98
  99 This manual documents version @value{VERSION} of the @sc{gnu} text utilities.
 100
 101 @menu
 102 * Introduction::                       Caveats, overview, and authors.
 103 * Common options::                     Common options.
 104 * Output of entire files::             cat tac nl od
 105 * Formatting file contents::           fmt pr fold
 106 * Output of parts of files::           head tail split csplit
 107 * Summarizing files::                  wc sum cksum md5sum
 108 * Operating on sorted files::          sort uniq comm ptx tsort
 109 * Operating on fields within a line::  cut paste join
 110 * Operating on characters::            tr expand unexpand
 111 * Opening the software toolbox::       The software tools philosophy.
 112 * Index::                              General index.
 113
 114 @detailmenu
 115  --- The Detailed Node Listing ---
 116
 117 Output of entire files
 118
 119 * cat invocation::              Concatenate and write files.
 120 * tac invocation::              Concatenate and write files in reverse.
 121 * nl invocation::               Number lines and write files.
 122 * od invocation::               Write files in octal or other formats.
 123
 124 Formatting file contents
 125
 126 * fmt invocation::              Reformat paragraph text.
 127 * pr invocation::               Paginate or columnate files for printing.
 128 * fold invocation::             Wrap input lines to fit in specified width.
 129
 130 Output of parts of files
 131
 132 * head invocation::             Output the first part of files.
 133 * tail invocation::             Output the last part of files.
 134 * split invocation::            Split a file into fixed-size pieces.
 135 * csplit invocation::           Split a file into context-determined pieces.
 136
 137 Summarizing files
 138
 139 * wc invocation::               Print byte, word, and line counts.
 140 * sum invocation::              Print checksum and block counts.
 141 * cksum invocation::            Print CRC checksum and byte counts.
 142 * md5sum invocation::           Print or check message-digests.
 143
 144 Operating on sorted files
 145
 146 * sort invocation::             Sort text files.
 147 * uniq invocation::             Uniquify files.
 148 * comm invocation::             Compare two sorted files line by line.
 149 * ptx invocation::              Produce a permuted index of file contents.
 150 * tsort invocation::            Topological sort.
 151
 152 @code{ptx}: Produce permuted indexes
 153
 154 * General options in ptx::      Options which affect general program behavior.
 155 * Charset selection in ptx::    Underlying character set considerations.
 156 * Input processing in ptx::     Input fields, contexts, and keyword selection.
 157 * Output formatting in ptx::    Types of output format, and sizing the fields.
 158 * Compatibility in ptx::        The GNU extensions to @code{ptx}
 159
 160 Operating on fields within a line
 161
 162 * cut invocation::              Print selected parts of lines.
 163 * paste invocation::            Merge lines of files.
 164 * join invocation::             Join lines on a common field.
 165
 166 Operating on characters
 167
 168 * tr invocation::               Translate, squeeze, and/or delete characters.
 169 * expand invocation::           Convert tabs to spaces.
 170 * unexpand invocation::         Convert spaces to tabs.
 171
 172 @code{tr}: Translate, squeeze, and/or delete characters
 173
 174 * Character sets::              Specifying sets of characters.
 175 * Translating::                 Changing one characters to another.
 176 * Squeezing::                   Squeezing repeats and deleting.
 177 * Warnings in tr::              Warning messages.
 178
 179 Opening the software toolbox
 180
 181 * Toolbox introduction::        Toolbox introduction
 182 * I/O redirection::             I/O redirection
 183 * The who command::             The @code{who} command
 184 * The cut command::             The @code{cut} command
 185 * The sort command::            The @code{sort} command
 186 * The uniq command::            The @code{uniq} command
 187 * Putting the tools together::  Putting the tools together
 188
 189 @end detailmenu
 190 @end menu
 191
 192 @end ifnottex
 193
 194
 195 @node Introduction
 196 @chapter Introduction
 197
 198 @cindex introduction
 199
 200 This manual is incomplete: No attempt is made to explain basic concepts
 201 in a way suitable for novices.  Thus, if you are interested, please get
 202 involved in improving this manual.  The entire @sc{gnu} community will
 203 benefit.
 204
 205 @cindex POSIX.2
 206 The @sc{gnu} text utilities are mostly compatible with the @sc{posix.2}
 207 standard.
 208
 209 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
 210 @c sh-utils.texi too -- so be sure to keep them consistent.
 211 @cindex bugs, reporting
 212 Please report bugs to @email{bug-textutils@@gnu.org}.  Remember
 213 to include the version number, machine architecture, input files, and
 214 any other information needed to reproduce the bug: your input, what you
 215 expected, what you got, and why it is wrong.  Diffs are welcome, but
 216 please include a description of the problem as well, since this is
 217 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
 218
 219 This manual was originally derived from the Unix man pages in the
 220 distribution, which were written by David MacKenzie and updated by Jim
 221 Meyering.  What you are reading now is the authoritative documentation
 222 for these utilities;  the man pages are no longer being maintained.
 223 The original @code{fmt} man page was written by Ross Paterson.
 224 Fran@,{c}ois Pinard did the initial conversion to Texinfo format.
 225 Karl Berry did the indexing, some reorganization, and editing of the results.
 226 Richard Stallman contributed his usual invaluable insights to the
 227 overall process.
 228
 229
 230 @node Common options
 231 @chapter Common options
 232
 233 @cindex common options
 234
 235 Certain options are available in all of these programs.  Rather than
 236 writing identical descriptions for each of the programs, they are
 237 described here.  (In fact, every @sc{gnu} program accepts (or should accept)
 238 these options.)
 239
 240 Some of these programs recognize the @samp{--help} and @samp{--version}
 241 options only when one of them is the sole command line argument.
 242
 243 @table @samp
 244
 245 @item --help
 246 @opindex --help
 247 @cindex help, online
 248 Print a usage message listing all available options, then exit successfully.
 249
 250 @item --version
 251 @opindex --version
 252 @cindex version number, finding
 253 Print the version number, then exit successfully.
 254
 255 @end table
 256
 257
 258 @node Output of entire files
 259 @chapter Output of entire files
 260
 261 @cindex output of entire files
 262 @cindex entire files, output of
 263
 264 These commands read and write entire files, possibly transforming them
 265 in some way.
 266
 267 @menu
 268 * cat invocation::              Concatenate and write files.
 269 * tac invocation::              Concatenate and write files in reverse.
 270 * nl invocation::               Number lines and write files.
 271 * od invocation::               Write files in octal or other formats.
 272 @end menu
 273
 274 @node cat invocation
 275 @section @code{cat}: Concatenate and write files
 276
 277 @pindex cat
 278 @cindex concatenate and write files
 279 @cindex copying files
 280
 281 @code{cat} copies each @var{file} (@samp{-} means standard input), or
 282 standard input if none are given, to standard output.  Synopsis:
 283
 284 @example
 285 cat [@var{option}] [@var{file}]@dots{}
 286 @end example
 287
 288 The program accepts the following options.  Also see @ref{Common options}.
 289
 290 @table @samp
 291
 292 @item -A
 293 @itemx --show-all
 294 @opindex -A
 295 @opindex --show-all
 296 Equivalent to @samp{-vET}.
 297
 298 @item -B
 299 @itemx --binary
 300 @opindex -B
 301 @opindex --binary
 302 @cindex binary and text I/O in cat
 303 On MS-DOS and MS-Windows only, read and write the files in binary mode.
 304 By default, @code{cat} on MS-DOS/MS-Windows uses binary mode only when
 305 standard output is redirected to a file or a pipe; this option overrides
 306 that.  Binary file I/O is used so that the files retain their format
 307 (Unix text as opposed to DOS text and binary), because @code{cat} is
 308 frequently used as a file-copying program.  Some options (see below)
 309 cause @code{cat} to read and write files in text mode because in those
 310 cases the original file contents aren't important (e.g., when lines are
 311 numbered by @code{cat}, or when line endings should be marked).  This is
 312 so these options work as DOS/Windows users would expect; for example,
 313 DOS-style text files have their lines end with the CR-LF pair of
 314 characters, which won't be processed as an empty line by @samp{-b} unless
 315 the file is read in text mode.
 316
 317 @item -b
 318 @itemx --number-nonblank
 319 @opindex -b
 320 @opindex --number-nonblank
 321 Number all nonblank output lines, starting with 1.  On MS-DOS and
 322 MS-Windows, this option causes @code{cat} to read and write files in
 323 text mode.
 324
 325 @item -e
 326 @opindex -e
 327 Equivalent to @samp{-vE}.
 328
 329 @item -E
 330 @itemx --show-ends
 331 @opindex -E
 332 @opindex --show-ends
 333 Display a @samp{$} after the end of each line.  On MS-DOS and
 334 MS-Windows, this option causes @code{cat} to read and write files in
 335 text mode.
 336
 337 @item -n
 338 @itemx --number
 339 @opindex -n
 340 @opindex --number
 341 Number all output lines, starting with 1.  On MS-DOS and MS-Windows,
 342 this option causes @code{cat} to read and write files in text mode.
 343
 344 @item -s
 345 @itemx --squeeze-blank
 346 @opindex -s
 347 @opindex --squeeze-blank
 348 @cindex squeezing blank lines
 349 Replace multiple adjacent blank lines with a single blank line.  On
 350 MS-DOS and MS-Windows, this option causes @code{cat} to read and write
 351 files in text mode.
 352
 353 @item -t
 354 @opindex -t
 355 Equivalent to @samp{-vT}.
 356
 357 @item -T
 358 @itemx --show-tabs
 359 @opindex -T
 360 @opindex --show-tabs
 361 Display TAB characters as @samp{^I}.
 362
 363 @item -u
 364 @opindex -u
 365 Ignored; for Unix compatibility.
 366
 367 @item -v
 368 @itemx --show-nonprinting
 369 @opindex -v
 370 @opindex --show-nonprinting
 371 Display control characters except for LFD and TAB using
 372 @samp{^} notation and precede characters that have the high bit set with
 373 @samp{M-}.  On MS-DOS and MS-Windows, this option causes @code{cat} to
 374 read files and standard input in DOS binary mode, so the CR
 375 characters at the end of each line are also visible.
 376
 377 @end table
 378
 379
 380 @node tac invocation
 381 @section @code{tac}: Concatenate and write files in reverse
 382
 383 @pindex tac
 384 @cindex reversing files
 385
 386 @code{tac} copies each @var{file} (@samp{-} means standard input), or
 387 standard input if none are given, to standard output, reversing the
 388 records (lines by default) in each separately.  Synopsis:
 389
 390 @example
 391 tac [@var{option}]@dots{} [@var{file}]@dots{}
 392 @end example
 393
 394 @dfn{Records} are separated by instances of a string (newline by
 395 default).  By default, this separator string is attached to the end of
 396 the record that it follows in the file.
 397
 398 The program accepts the following options.  Also see @ref{Common options}.
 399
 400 @table @samp
 401
 402 @item -b
 403 @itemx --before
 404 @opindex -b
 405 @opindex --before
 406 The separator is attached to the beginning of the record that it
 407 precedes in the file.
 408
 409 @item -r
 410 @itemx --regex
 411 @opindex -r
 412 @opindex --regex
 413 Treat the separator string as a regular expression.  Users of @code{tac}
 414 on MS-DOS/MS-Windows should note that, since @code{tac} reads files in
 415 binary mode, each line of a text file might end with a CR/LF pair
 416 instead of the Unix-style LF.
 417
 418 @item -s @var{separator}
 419 @itemx --separator=@var{separator}
 420 @opindex -s
 421 @opindex --separator
 422 Use @var{separator} as the record separator, instead of newline.
 423
 424 @end table
 425
 426
 427 @node nl invocation
 428 @section @code{nl}: Number lines and write files
 429
 430 @pindex nl
 431 @cindex numbering lines
 432 @cindex line numbering
 433
 434 @code{nl} writes each @var{file} (@samp{-} means standard input), or
 435 standard input if none are given, to standard output, with line numbers
 436 added to some or all of the lines.  Synopsis:
 437
 438 @example
 439 nl [@var{option}]@dots{} [@var{file}]@dots{}
 440 @end example
 441
 442 @cindex logical pages, numbering on
 443 @code{nl} decomposes its input into (logical) pages; by default, the
 444 line number is reset to 1 at the top of each logical page.  @code{nl}
 445 treats all of the input files as a single document; it does not reset
 446 line numbers or logical pages between files.
 447
 448 @cindex headers, numbering
 449 @cindex body, numbering
 450 @cindex footers, numbering
 451 A logical page consists of three sections: header, body, and footer.
 452 Any of the sections can be empty.  Each can be numbered in a different
 453 style from the others.
 454
 455 The beginnings of the sections of logical pages are indicated in the
 456 input file by a line containing exactly one of these delimiter strings:
 457
 458 @table @samp
 459 @item \:\:\:
 460 start of header;
 461 @item \:\:
 462 start of body;
 463 @item \:
 464 start of footer.
 465 @end table
 466
 467 The two characters from which these strings are made can be changed from
 468 @samp{\} and @samp{:} via options (see below), but the pattern and
 469 length of each string cannot be changed.
 470
 471 A section delimiter is replaced by an empty line on output.  Any text
 472 that comes before the first section delimiter string in the input file
 473 is considered to be part of a body section, so @code{nl} treats a
 474 file that contains no section delimiters as a single body section.
 475
 476 The program accepts the following options.  Also see @ref{Common options}.
 477
 478 @table @samp
 479
 480 @item -b @var{style}
 481 @itemx --body-numbering=@var{style}
 482 @opindex -b
 483 @opindex --body-numbering
 484 Select the numbering style for lines in the body section of each
 485 logical page.  When a line is not numbered, the current line number
 486 is not incremented, but the line number separator character is still
 487 prepended to the line.  The styles are:
 488
 489 @table @samp
 490 @item a
 491 number all lines,
 492 @item t
 493 number only nonempty lines (default for body),
 494 @item n
 495 do not number lines (default for header and footer),
 496 @item p@var{regexp}
 497 number only lines that contain a match for @var{regexp}.
 498 @end table
 499
 500 @item -d @var{cd}
 501 @itemx --section-delimiter=@var{cd}
 502 @opindex -d
 503 @opindex --section-delimiter
 504 @cindex section delimiters of pages
 505 Set the section delimiter characters to @var{cd}; default is
 506 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
 507 (Remember to protect @samp{\} or other metacharacters from shell
 508 expansion with quotes or extra backslashes.)
 509
 510 @item -f @var{style}
 511 @itemx --footer-numbering=@var{style}
 512 @opindex -f
 513 @opindex --footer-numbering
 514 Analogous to @samp{--body-numbering}.
 515
 516 @item -h @var{style}
 517 @itemx --header-numbering=@var{style}
 518 @opindex -h
 519 @opindex --header-numbering
 520 Analogous to @samp{--body-numbering}.
 521
 522 @item -i @var{number}
 523 @itemx --page-increment=@var{number}
 524 @opindex -i
 525 @opindex --page-increment
 526 Increment line numbers by @var{number} (default 1).
 527
 528 @item -l @var{number}
 529 @itemx --join-blank-lines=@var{number}
 530 @opindex -l
 531 @opindex --join-blank-lines
 532 @cindex empty lines, numbering
 533 @cindex blank lines, numbering
 534 Consider @var{number} (default 1) consecutive empty lines to be one
 535 logical line for numbering, and only number the last one.  Where fewer
 536 than @var{number} consecutive empty lines occur, do not number them.
 537 An empty line is one that contains no characters, not even spaces
 538 or tabs.
 539
 540 @item -n @var{format}
 541 @itemx --number-format=@var{format}
 542 @opindex -n
 543 @opindex --number-format
 544 Select the line numbering format (default is @code{rn}):
 545
 546 @table @samp
 547 @item ln
 548 @opindex ln @r{format for @code{nl}}
 549 left justified, no leading zeros;
 550 @item rn
 551 @opindex rn @r{format for @code{nl}}
 552 right justified, no leading zeros;
 553 @item rz
 554 @opindex rz @r{format for @code{nl}}
 555 right justified, leading zeros.
 556 @end table
 557
 558 @item -p
 559 @itemx --no-renumber
 560 @opindex -p
 561 @opindex --no-renumber
 562 Do not reset the line number at the start of a logical page.
 563
 564 @item -s @var{string}
 565 @itemx --number-separator=@var{string}
 566 @opindex -s
 567 @opindex --number-separator
 568 Separate the line number from the text line in the output with
 569 @var{string} (default is the TAB character).
 570
 571 @item -v @var{number}
 572 @itemx --starting-line-number=@var{number}
 573 @opindex -v
 574 @opindex --starting-line-number
 575 Set the initial line number on each logical page to @var{number} (default 1).
 576
 577 @item -w @var{number}
 578 @itemx --number-width=@var{number}
 579 @opindex -w
 580 @opindex --number-width
 581 Use @var{number} characters for line numbers (default 6).
 582
 583 @end table
 584
 585
 586 @node od invocation
 587 @section @code{od}: Write files in octal or other formats
 588
 589 @pindex od
 590 @cindex octal dump of files
 591 @cindex hex dump of files
 592 @cindex ASCII dump of files
 593 @cindex file contents, dumping unambiguously
 594
 595 @code{od} writes an unambiguous representation of each @var{file}
 596 (@samp{-} means standard input), or standard input if none are given.
 597 Synopsis:
 598
 599 @example
 600 od [@var{option}]@dots{} [@var{file}]@dots{}
 601 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
 602 @end example
 603
 604 Each line of output consists of the offset in the input, followed by
 605 groups of data from the file. By default, @code{od} prints the offset in
 606 octal, and each group of file data is two bytes of input printed as a
 607 single octal number.
 608
 609 The program accepts the following options.  Also see @ref{Common options}.
 610
 611 @table @samp
 612
 613 @item -A @var{radix}
 614 @itemx --address-radix=@var{radix}
 615 @opindex -A
 616 @opindex --address-radix
 617 @cindex radix for file offsets
 618 @cindex file offset radix
 619 Select the base in which file offsets are printed.  @var{radix} can
 620 be one of the following:
 621
 622 @table @samp
 623 @item d
 624 decimal;
 625 @item o
 626 octal;
 627 @item x
 628 hexadecimal;
 629 @item n
 630 none (do not print offsets).
 631 @end table
 632
 633 The default is octal.
 634
 635 @item -j @var{bytes}
 636 @itemx --skip-bytes=@var{bytes}
 637 @opindex -j
 638 @opindex --skip-bytes
 639 Skip @var{bytes} input bytes before formatting and writing.  If
 640 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
 641 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
 642 in decimal.  Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
 643 by 1024, and @samp{m} by 1048576.
 644
 645 @item -N @var{bytes}
 646 @itemx --read-bytes=@var{bytes}
 647 @opindex -N
 648 @opindex --read-bytes
 649 Output at most @var{bytes} bytes of the input.  Prefixes and suffixes on
 650 @code{bytes} are interpreted as for the @samp{-j} option.
 651
 652 @item -s [@var{n}]
 653 @itemx --strings[=@var{n}]
 654 @opindex -s
 655 @opindex --strings
 656 @cindex string constants, outputting
 657 Instead of the normal output, output only @dfn{string constants}: at
 658 least @var{n} (3 by default) consecutive @sc{ascii} graphic characters,
 659 followed by a null (zero) byte.
 660
 661 @item -t @var{type}
 662 @itemx --format=@var{type}
 663 @opindex -t
 664 @opindex --format
 665 Select the format in which to output the file data.  @var{type} is a
 666 string of one or more of the below type indicator characters.  If you
 667 include more than one type indicator character in a single @var{type}
 668 string, or use this option more than once, @code{od} writes one copy
 669 of each output line using each of the data types that you specified,
 670 in the order that you specified.
 671
 672 Adding a trailing ``z'' to any type specification appends a display
 673 of the @sc{ascii} character representation of the printable characters
 674 to the output line generated by the type specification.
 675
 676 @table @samp
 677 @item a
 678 named character
 679 @item c
 680 @sc{ascii} character or backslash escape,
 681 @item d
 682 signed decimal
 683 @item f
 684 floating point
 685 @item o
 686 octal
 687 @item u
 688 unsigned decimal
 689 @item x
 690 hexadecimal
 691 @end table
 692
 693 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
 694 newline, and @samp{nul} for a null (zero) byte.  Type @code{c} outputs
 695 @samp{ }, @samp{\n}, and @code{\0}, respectively.
 696
 697 @cindex type size
 698 Except for types @samp{a} and @samp{c}, you can specify the number
 699 of bytes to use in interpreting each number in the given data type
 700 by following the type indicator character with a decimal integer.
 701 Alternately, you can specify the size of one of the C compiler's
 702 built-in data types by following the type indicator character with
 703 one of the following characters.  For integers (@samp{d}, @samp{o},
 704 @samp{u}, @samp{x}):
 705
 706 @table @samp
 707 @item C
 708 char
 709 @item S
 710 short
 711 @item I
 712 int
 713 @item L
 714 long
 715 @end table
 716
 717 For floating point (@code{f}):
 718
 719 @table @asis
 720 @item F
 721 float
 722 @item D
 723 double
 724 @item L
 725 long double
 726 @end table
 727
 728 @item -v
 729 @itemx --output-duplicates
 730 @opindex -v
 731 @opindex --output-duplicates
 732 Output consecutive lines that are identical.  By default, when two or
 733 more consecutive output lines would be identical, @code{od} outputs only
 734 the first line, and puts just an asterisk on the following line to
 735 indicate the elision.
 736
 737 @item -w[@var{n}]
 738 @itemx --width[=@var{n}]
 739 @opindex -w
 740 @opindex --width
 741 Dump @code{n} input bytes per output line.  This must be a multiple of
 742 the least common multiple of the sizes associated with the specified
 743 output types.  If @var{n} is omitted, the default is 32.  If this option
 744 is not given at all, the default is 16.
 745
 746 @end table
 747
 748 The next several options map the old, pre-@sc{posix} format specification
 749 options to the corresponding @sc{posix} format specs.
 750 @sc{gnu} @code{od} accepts
 751 any combination of old- and new-style options.  Format specification
 752 options accumulate.
 753
 754 @table @samp
 755
 756 @item -a
 757 @opindex -a
 758 Output as named characters.  Equivalent to @samp{-ta}.
 759
 760 @item -b
 761 @opindex -b
 762 Output as octal bytes.  Equivalent to @samp{-toC}.
 763
 764 @item -c
 765 @opindex -c
 766 Output as @sc{ascii} characters or backslash escapes.  Equivalent to
 767 @samp{-tc}.
 768
 769 @item -d
 770 @opindex -d
 771 Output as unsigned decimal shorts.  Equivalent to @samp{-tu2}.
 772
 773 @item -f
 774 @opindex -f
 775 Output as floats.  Equivalent to @samp{-tfF}.
 776
 777 @item -h
 778 @opindex -h
 779 Output as hexadecimal shorts.  Equivalent to @samp{-tx2}.
 780
 781 @item -i
 782 @opindex -i
 783 Output as decimal shorts.  Equivalent to @samp{-td2}.
 784
 785 @item -l
 786 @opindex -l
 787 Output as decimal longs.  Equivalent to @samp{-td4}.
 788
 789 @item -o
 790 @opindex -o
 791 Output as octal shorts.  Equivalent to @samp{-to2}.
 792
 793 @item -x
 794 @opindex -x
 795 Output as hexadecimal shorts.  Equivalent to @samp{-tx2}.
 796
 797 @item -C
 798 @itemx --traditional
 799 @opindex --traditional
 800 Recognize the pre-@sc{posix} non-option arguments that traditional @code{od}
 801 accepted.  The following syntax:
 802
 803 @smallexample
 804 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
 805 @end smallexample
 806
 807 @noindent
 808 can be used to specify at most one file and optional arguments
 809 specifying an offset and a pseudo-start address, @var{label}.  By
 810 default, @var{offset} is interpreted as an octal number specifying how
 811 many input bytes to skip before formatting and writing.  The optional
 812 trailing decimal point forces the interpretation of @var{offset} as a
 813 decimal number.  If no decimal is specified and the offset begins with
 814 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number.  If
 815 there is a trailing @samp{b}, the number of bytes skipped will be
 816 @var{offset} multiplied by 512.  The @var{label} argument is interpreted
 817 just like @var{offset}, but it specifies an initial pseudo-address.  The
 818 pseudo-addresses are displayed in parentheses following any normal
 819 address.
 820
 821 @end table
 822
 823
 824 @node Formatting file contents
 825 @chapter Formatting file contents
 826
 827 @cindex formatting file contents
 828
 829 These commands reformat the contents of files.
 830
 831 @menu
 832 * fmt invocation::              Reformat paragraph text.
 833 * pr invocation::               Paginate or columnate files for printing.
 834 * fold invocation::             Wrap input lines to fit in specified width.
 835 @end menu
 836
 837
 838 @node fmt invocation
 839 @section @code{fmt}: Reformat paragraph text
 840
 841 @pindex fmt
 842 @cindex reformatting paragraph text
 843 @cindex paragraphs, reformatting
 844 @cindex text, reformatting
 845
 846 @code{fmt} fills and joins lines to produce output lines of (at most)
 847 a given number of characters (75 by default).  Synopsis:
 848
 849 @example
 850 fmt [@var{option}]@dots{} [@var{file}]@dots{}
 851 @end example
 852
 853 @code{fmt} reads from the specified @var{file} arguments (or standard
 854 input if none are given), and writes to standard output.
 855
 856 By default, blank lines, spaces between words, and indentation are
 857 preserved in the output; successive input lines with different
 858 indentation are not joined; tabs are expanded on input and introduced on
 859 output.
 860
 861 @cindex line-breaking
 862 @cindex sentences and line-breaking
 863 @cindex Knuth, Donald E.
 864 @cindex Plass, Michael F.
 865 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
 866 avoid line breaks after the first word of a sentence or before the last
 867 word of a sentence.  A @dfn{sentence break} is defined as either the end
 868 of a paragraph or a word ending in any of @samp{.?!}, followed by two
 869 spaces or end of line, ignoring any intervening parentheses or quotes.
 870 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
 871 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
 872 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
 873 and Experience}, 11 (1981), 1119--1184).
 874
 875 The program accepts the following options.  Also see @ref{Common options}.
 876
 877 @table @samp
 878
 879 @item -c
 880 @itemx --crown-margin
 881 @opindex -c
 882 @opindex --crown-margin
 883 @cindex crown margin
 884 @dfn{Crown margin} mode: preserve the indentation of the first two
 885 lines within a paragraph, and align the left margin of each subsequent
 886 line with that of the second line.
 887
 888 @item -t
 889 @itemx --tagged-paragraph
 890 @opindex -t
 891 @opindex --tagged-paragraph
 892 @cindex tagged paragraphs
 893 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
 894 indentation of the first line of a paragraph is the same as the
 895 indentation of the second, the first line is treated as a one-line
 896 paragraph.
 897
 898 @item -s
 899 @itemx --split-only
 900 @opindex -s
 901 @opindex --split-only
 902 Split lines only.  Do not join short lines to form longer ones.  This
 903 prevents sample lines of code, and other such ``formatted'' text from
 904 being unduly combined.
 905
 906 @item -u
 907 @itemx --uniform-spacing
 908 @opindex -u
 909 @opindex --uniform-spacing
 910 Uniform spacing.  Reduce spacing between words to one space, and spacing
 911 between sentences to two spaces.
 912
 913 @item -@var{width}
 914 @itemx -w @var{width}
 915 @itemx --width=@var{width}
 916 @opindex -@var{width}
 917 @opindex -w
 918 @opindex --width
 919 Fill output lines up to @var{width} characters (default 75).  @code{fmt}
 920 initially tries to make lines about 7% shorter than this, to give it
 921 room to balance line lengths.
 922
 923 @item -p @var{prefix}
 924 @itemx --prefix=@var{prefix}
 925 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
 926 are subject to formatting. The prefix and any preceding whitespace are
 927 stripped for the formatting and then re-attached to each formatted output
 928 line.  One use is to format certain kinds of program comments, while
 929 leaving the code unchanged.
 930
 931 @end table
 932
 933
 934 @node pr invocation
 935 @section @code{pr}: Paginate or columnate files for printing
 936
 937 @pindex pr
 938 @cindex printing, preparing files for
 939 @cindex multicolumn output, generating
 940 @cindex merging files in parallel
 941
 942 @code{pr} writes each @var{file} (@samp{-} means standard input), or
 943 standard input if none are given, to standard output, paginating and
 944 optionally outputting in multicolumn format; optionally merges all
 945 @var{file}s, printing all in parallel, one per column.  Synopsis:
 946
 947 @example
 948 pr [@var{option}]@dots{} [@var{file}]@dots{}
 949 @end example
 950
 951 By default, a 5-line header is printed at each page: two blank lines;
 952 a line with the date, the filename, and the page count; and two more
 953 blank lines.  A footer of five blank lines is also printed.  With the @samp{-F}
 954 option, a 3-line header is printed: the leading two blank lines are
 955 omitted; no footer is used.  The default @var{page_length} in both cases is 66
 956 lines.  The default number of text lines changes from 56 (without @samp{-F})
 957 to 63 (with @samp{-F}).  The text line of the header takes up the full
 958 @var{page_width} in the form @samp{yyyy-mm-dd HH:MM string Page nnnn}.
 959 String is a centered header string.
 960
 961 Form feeds in the input cause page breaks in the output.  Multiple form
 962 feeds produce empty pages.
 963
 964 Columns are of equal width, separated by an optional string (default
 965 is @samp{space}).  For multicolumn output, lines will always be truncated to
 966 @var{page_width} (default 72), unless you use the @samp{-J} option.  For single
 967 column output no line truncation occurs by default.  Use @samp{-W} option to
 968 truncate lines in that case.
 969
 970 The following changes were made in version 1.22i and apply to later
 971 versions of @command{pr}:
 972 @c FIXME: this whole section here sounds very awkward to me. I
 973 @c made a few small changes, but really it all needs to be redone. - Brian
 974 @c OK, I fixed another sentence or two, but some of it I just don't understand.
 975 @ - Brian
 976 @itemize @bullet
 977
 978 @item
 979 Some small @var{letter options} (@samp{-s}, @samp{-w}) have been
 980 redefined for better @sc{posix} compliance.  The output of some further
 981 cases has been adapted to other Unix systems.  These changes are not
 982 compatible with earlier versions of the program.
 983
 984 @item
 985 Some @var{new capital letter} options (@samp{-J}, @samp{-S}, @samp{-W})
 986 have been introduced to turn off unexpected interferences of small letter
 987 options.  The @samp{-N} option and the second argument @var{last_page}
 988 of @samp{+FIRST_PAGE} offer more flexibility.  The detailed handling of
 989 form feeds set in the input files requires the @samp{-T} option.
 990
 991 @item
 992 Capital letter options override small letter ones.
 993
 994 @item
 995 Some of the option-arguments (compare @samp{-s}, @samp{-S}, @samp{-e},
 996 @samp{-i}, @samp{-n}) cannot be specified as separate arguments from the
 997 preceding option letter (already stated in the @sc{posix} specification).
 998 @end itemize
 999
1000 The program accepts the following options.  Also see @ref{Common options}.
1001
1002 @table @samp
1003
1004 @item +@var{first_page}[:@var{last_page}]
1005 @itemx --pages=@var{first_page}[:@var{last_page}]
1006 @opindex +@var{first_page}[:@var{last_page}]
1007 @opindex --pages
1008 Begin printing with page @var{first_page} and stop with @var{last_page}.
1009 Missing @samp{:@var{last_page}} implies end of file.  While estimating
1010 the number of skipped pages each form feed in the input file results
1011 in a new page.  Page counting with and without @samp{+@var{first_page}}
1012 is identical.  By default, counting starts with the first page of input
1013 file (not first page printed).  Line numbering may be altered by @samp{-N}
1014 option.
1015
1016 @item -@var{column}
1017 @itemx --columns=@var{column}
1018 @opindex -@var{column}
1019 @opindex --columns
1020 @cindex down columns
1021 With each single @var{file}, produce @var{column} columns of output
1022 (default is 1) and print columns down, unless @samp{-a} is used.  The
1023 column width is automatically decreased as @var{column} increases; unless
1024 you use the @samp{-W/-w} option to increase @var{page_width} as well.
1025 This option might well cause some lines to be truncated.  The number of
1026 lines in the columns on each page are balanced.  The options @samp{-e}
1027 and @samp{-i} are on for multiple text-column output.  Together with
1028 @samp{-J} option column alignment and line truncation is turned off.
1029 Lines of full length are joined in a free field format and @samp{-S}
1030 option may set field separators.  @samp{-@var{column}} may not be used
1031 with @samp{-m} option.
1032
1033 @item -a
1034 @itemx --across
1035 @opindex -a
1036 @opindex --across
1037 @cindex across columns
1038 With each single @var{file}, print columns across rather than down.  The
1039 @samp{-@var{column}} option must be given with @var{column} greater than one.
1040 If a line is too long to fit in a column, it is truncated.
1041
1042 @item -c
1043 @itemx --show-control-chars
1044 @opindex -c
1045 @opindex --show-control-chars
1046 Print control characters using hat notation (e.g., @samp{^G}); print
1047 other unprintable characters in octal backslash notation.  By default,
1048 unprintable characters are not changed.
1049
1050 @item -d
1051 @itemx --double-space
1052 @opindex -d
1053 @opindex --double-space
1054 @cindex double spacing
1055 Double space the output.
1056
1057 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
1058 @itemx --expand-tabs[=@var{in-tabchar}[@var{in-tabwidth}]]
1059 @opindex -e
1060 @opindex --expand-tabs
1061 @cindex input tabs
1062 Expand @var{tab}s to spaces on input.  Optional argument @var{in-tabchar} is
1063 the input tab character (default is the TAB character).  Second optional
1064 argument @var{in-tabwidth} is the input tab character's width (default
1065 is 8).
1066
1067 @item -f
1068 @itemx -F
1069 @itemx --form-feed
1070 @opindex -F
1071 @opindex -f
1072 @opindex --form-feed
1073 Use a form feed instead of newlines to separate output pages.  The default
1074 page length of 66 lines is not altered.  But the number of lines of text
1075 per page changes from default 56 to 63 lines.
1076
1077 @item -h @var{HEADER}
1078 @itemx --header=@var{HEADER}
1079 @opindex -h
1080 @opindex --header
1081 Replace the filename in the header with the centered string @var{header}.
1082 Left-hand-side truncation (marked by a @samp{*}) may occur if the total
1083 header line @samp{yyyy-mm-dd HH:MM HEADER Page nnnn} becomes larger than
1084 @var{page_width}.  @samp{-h ""} prints a blank line header.  Don't use
1085 @samp{-h""}.
1086 A space between the @samp{-h} option and the argument is always
1087 indispensable.
1088
1089 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
1090 @itemx --output-tabs[=@var{out-tabchar}[@var{out-tabwidth}]]
1091 @opindex -i
1092 @opindex --output-tabs
1093 @cindex output tabs
1094 Replace spaces with @var{tab}s on output.  Optional argument @var{out-tabchar}
1095 is the output tab character (default is the TAB character).  Second optional
1096 argument @var{out-tabwidth} is the output tab character's width (default
1097 is 8).
1098
1099 @item -J
1100 @itemx --join-lines
1101 @opindex -J
1102 @opindex --join-lines
1103 Merge lines of full length.  Used together with the column options
1104 @samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}.  Turns off
1105 @samp{-W/-w} line truncation;
1106 no column alignment used; may be used with @samp{-S[@var{string}]}.
1107 @samp{-J} has been introduced (together with @samp{-W} and @samp{-S})
1108 to disentangle the old (@sc{posix}-compliant) options @samp{-w} and
1109 @samp{-s} along with the three column options.
1110
1111
1112 @item -l @var{page_length}
1113 @itemx --length=@var{page_length}
1114 @opindex -l
1115 @opindex --length
1116 Set the page length to @var{page_length} (default 66) lines, including
1117 the lines of the header [and the footer].  If @var{page_length} is less
1118 than or equal to 10 (or <= 3 with @samp{-F}), the header and footer are
1119 omitted, and all form feeds set in input files are eliminated, as if
1120 the @samp{-T} option had been given.
1121
1122 @item -m
1123 @itemx --merge
1124 @opindex -m
1125 @opindex --merge
1126 Merge and print all @var{file}s in parallel, one in each column.  If a
1127 line is too long to fit in a column, it is truncated, unless the @samp{-J}
1128 option is used.  @samp{-S[@var{string}]} may be used.  Empty pages in
1129 some @var{file}s (form feeds set) produce empty columns, still marked
1130 by @var{string}.  The result is a continuous line numbering and column
1131 marking throughout the whole merged file.  Completely empty merged pages
1132 show no separators or line numbers.  The default header becomes
1133 @samp{yyyy-mm-dd HH:MM <blanks> Page nnnn}; may be used with
1134 @samp{-h @var{header}} to fill up the middle blank part.
1135
1136 @item -n[@var{number-separator}[@var{digits}]]
1137 @itemx --number-lines[=@var{number-separator}[@var{digits}]]
1138 @opindex -n
1139 @opindex --number-lines
1140 Provide @var{digits} digit line numbering (default for @var{digits} is
1141 5).  With multicolumn output the number occupies the first @var{digits}
1142 column positions of each text column or only each line of @samp{-m}
1143 output.  With single column output the number precedes each line just as
1144 @samp{-m} does.  Default counting of the line numbers starts with the
1145 first line of the input file (not the first line printed, compare the
1146 @samp{--page} option and @samp{-N} option).
1147 Optional argument @var{number-separator} is the character appended to
1148 the line number to separate it from the text followed.  The default
1149 separator is the TAB character.  In a strict sense a TAB is always
1150 printed with single column output only.  The @var{TAB}-width varies
1151 with the @var{TAB}-position, e.g. with the left @var{margin} specified
1152 by @samp{-o} option.  With multicolumn output priority is given to
1153 @samp{equal width of output columns} (a @sc{posix} specification).
1154 The @var{TAB}-width is fixed to the value of the first column and does
1155 not change with different values of left @var{margin}.  That means a
1156 fixed number of spaces is always printed in the place of the
1157 @var{number-separator tab}.  The tabification depends upon the output
1158 position.
1159
1160 @item -N @var{line_number}
1161 @itemx --first-line-number=@var{line_number}
1162 @opindex -N
1163 @opindex --first-line-number
1164 Start line counting with the number @var{line_number} at first line of
1165 first page printed (in most cases not the first line of the input file).
1166
1167 @item -o @var{margin}
1168 @itemx --indent=@var{margin}
1169 @opindex -o
1170 @opindex --indent
1171 @cindex indenting lines
1172 @cindex left margin
1173 Indent each line with a margin @var{margin} spaces wide (default is zero).
1174 The total page width is the size of the margin plus the @var{page_width}
1175 set with the @samp{-W/-w} option.  A limited overflow may occur with
1176 numbered single column output (compare @samp{-n} option).
1177
1178 @item -r
1179 @itemx --no-file-warnings
1180 @opindex -r
1181 @opindex --no-file-warnings
1182 Do not print a warning message when an argument @var{file} cannot be
1183 opened.  (The exit status will still be nonzero, however.)
1184
1185 @item -s[@var{char}]
1186 @itemx --separator[=@var{char}]
1187 @opindex -s
1188 @opindex --separator
1189 Separate columns by a single character @var{char}.  The default for
1190 @var{char} is the TAB character without @samp{-w} and @samp{no
1191 character} with @samp{-w}.  Without @samp{-s} the default separator
1192 @samp{space} is set.  @samp{-s[char]} turns off line truncation of all
1193 three column options (@samp{-COLUMN}|@samp{-a -COLUMN}|@samp{-m}) unless
1194 @samp{-w} is set.  This is a @sc{posix}-compliant formulation.
1195
1196
1197 @item -S[@var{string}]
1198 @itemx --sep-string[=@var{string}]
1199 @opindex -S
1200 @opindex --sep-string
1201 Use @var{string} to separate output columns.  The @samp{-S} option doesn't
1202 affect the @samp{-W/-w} option, unlike the @samp{-s} option which does.  It
1203 does not affect line truncation or column alignment.
1204 Without @samp{-S}, and with @samp{-J}, @code{pr} uses the default output
1205 separator, TAB.
1206 Without @samp{-S} or @samp{-J}, @code{pr} uses a @samp{space}
1207 (same as @samp{-S" "}).
1208 Using @samp{-S} with no @var{string} is equivalent to @samp{-S""}.
1209 Note that for some of @code{pr}'s options the single-letter option
1210 character must be followed immediately by any corresponding argument;
1211 there may not be any intervening white space.
1212 @samp{-S/-s} is one of them.  Don't use @samp{-S "STRING"}.
1213 @sc{posix} requires this.
1214
1215 @item -t
1216 @itemx --omit-header
1217 @opindex -t
1218 @opindex --omit-header
1219 Do not print the usual header [and footer] on each page, and do not fill
1220 out the bottom of pages (with blank lines or a form feed).  No page
1221 structure is produced, but form feeds set in the input files are retained.
1222 The predefined pagination is not changed.  @samp{-t} or @samp{-T} may be
1223 useful together with other options; e.g.: @samp{-t -e4}, expand TAB characters
1224 in the input file to 4 spaces but don't make any other changes.  Use of
1225 @samp{-t} overrides @samp{-h}.
1226
1227 @item -T
1228 @itemx --omit-pagination
1229 @opindex -T
1230 @opindex --omit-pagination
1231 Do not print header [and footer].  In addition eliminate all form feeds
1232 set in the input files.
1233
1234 @item -v
1235 @itemx --show-nonprinting
1236 @opindex -v
1237 @opindex --show-nonprinting
1238 Print unprintable characters in octal backslash notation.
1239
1240 @item -w @var{page_width}
1241 @itemx --width=@var{page_width}
1242 @opindex -w
1243 @opindex --width
1244 Set page width to @var{page_width} characters for multiple text-column
1245 output only (default for @var{page_width} is 72).  @samp{-s[CHAR]} turns
1246 off the default page width and any line truncation and column alignment.
1247 Lines of full length are merged, regardless of the column options
1248 set.  No @var{page_width} setting is possible with single column output.
1249 A @sc{posix}-compliant formulation.
1250
1251 @item -W @var{page_width}
1252 @itemx --page_width=@var{page_width}
1253 @opindex -W
1254 @opindex --page_width
1255 Set the page width to @var{page_width} characters.  That's valid with and
1256 without a column option.  Text lines are truncated, unless @samp{-J}
1257 is used.  Together with one of the three column options
1258 (@samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}) column
1259 alignment is always used.  The separator options @samp{-S} or @samp{-s}
1260 don't affect the @samp{-W} option.  Default is 72 characters.  Without
1261 @samp{-W @var{page_width}} and without any of the column options NO line
1262 truncation is used (defined to keep downward compatibility and to meet
1263 most frequent tasks).  That's equivalent to @samp{-W 72 -J}.  With and
1264 without @samp{-W @var{page_width}} the header line is always truncated
1265 to avoid line overflow.
1266
1267 @end table
1268
1269
1270 @node fold invocation
1271 @section @code{fold}: Wrap input lines to fit in specified width
1272
1273 @pindex fold
1274 @cindex wrapping long input lines
1275 @cindex folding long input lines
1276
1277 @code{fold} writes each @var{file} (@samp{-} means standard input), or
1278 standard input if none are given, to standard output, breaking long
1279 lines.  Synopsis:
1280
1281 @example
1282 fold [@var{option}]@dots{} [@var{file}]@dots{}
1283 @end example
1284
1285 By default, @code{fold} breaks lines wider than 80 columns.  The output
1286 is split into as many lines as necessary.
1287
1288 @cindex screen columns
1289 @code{fold} counts screen columns by default; thus, a tab may count more
1290 than one column, backspace decreases the column count, and carriage
1291 return sets the column to zero.
1292
1293 The program accepts the following options.  Also see @ref{Common options}.
1294
1295 @table @samp
1296
1297 @item -b
1298 @itemx --bytes
1299 @opindex -b
1300 @opindex --bytes
1301 Count bytes rather than columns, so that tabs, backspaces, and carriage
1302 returns are each counted as taking up one column, just like other
1303 characters.
1304
1305 @item -s
1306 @itemx --spaces
1307 @opindex -s
1308 @opindex --spaces
1309 Break at word boundaries: the line is broken after the last blank before
1310 the maximum line length.  If the line contains no such blanks, the line
1311 is broken at the maximum line length as usual.
1312
1313 @item -w @var{width}
1314 @itemx --width=@var{width}
1315 @opindex -w
1316 @opindex --width
1317 Use a maximum line length of @var{width} columns instead of 80.
1318
1319 @end table
1320
1321
1322 @node Output of parts of files
1323 @chapter Output of parts of files
1324
1325 @cindex output of parts of files
1326 @cindex parts of files, output of
1327
1328 These commands output pieces of the input.
1329
1330 @menu
1331 * head invocation::             Output the first part of files.
1332 * tail invocation::             Output the last part of files.
1333 * split invocation::            Split a file into fixed-size pieces.
1334 * csplit invocation::           Split a file into context-determined pieces.
1335 @end menu
1336
1337 @node head invocation
1338 @section @code{head}: Output the first part of files
1339
1340 @pindex head
1341 @cindex initial part of files, outputting
1342 @cindex first part of files, outputting
1343
1344 @code{head} prints the first part (10 lines by default) of each
1345 @var{file}; it reads from standard input if no files are given or
1346 when given a @var{file} of @samp{-}.  Synopses:
1347
1348 @example
1349 head [@var{option}]@dots{} [@var{file}]@dots{}
1350 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1351 @end example
1352
1353 If more than one @var{file} is specified, @code{head} prints a
1354 one-line header consisting of
1355 @example
1356 ==> @var{file name} <==
1357 @end example
1358 @noindent
1359 before the output for each @var{file}.
1360
1361 @code{head} accepts two option formats: the new one, in which numbers
1362 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1363 the number precedes any option letters (@samp{-1q}).
1364
1365 The program accepts the following options.  Also see @ref{Common options}.
1366
1367 @table @samp
1368
1369 @item -@var{count}@var{options}
1370 @opindex -@var{count}
1371 This option is only recognized if it is specified first.  @var{count} is
1372 a decimal number optionally followed by a size letter (@samp{b},
1373 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1374 or other option letters (@samp{cqv}).
1375
1376 @item -c @var{bytes}
1377 @itemx --bytes=@var{bytes}
1378 @opindex -c
1379 @opindex --bytes
1380 Print the first @var{bytes} bytes, instead of initial lines.  Appending
1381 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1382 by 1048576.
1383
1384 @itemx -n @var{n}
1385 @itemx --lines=@var{n}
1386 @opindex -n
1387 @opindex --lines
1388 Output the first @var{n} lines.
1389
1390 @item -q
1391 @itemx --quiet
1392 @itemx --silent
1393 @opindex -q
1394 @opindex --quiet
1395 @opindex --silent
1396 Never print file name headers.
1397
1398 @item -v
1399 @itemx --verbose
1400 @opindex -v
1401 @opindex --verbose
1402 Always print file name headers.
1403
1404 @end table
1405
1406
1407 @node tail invocation
1408 @section @code{tail}: Output the last part of files
1409
1410 @pindex tail
1411 @cindex last part of files, outputting
1412
1413 @code{tail} prints the last part (10 lines by default) of each
1414 @var{file}; it reads from standard input if no files are given or
1415 when given a @var{file} of @samp{-}.  Synopses:
1416
1417 @example
1418 tail [@var{option}]@dots{} [@var{file}]@dots{}
1419 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1420 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1421 @end example
1422
1423 If more than one @var{file} is specified, @code{tail} prints a
1424 one-line header consisting of
1425 @example
1426 ==> @var{file name} <==
1427 @end example
1428 @noindent
1429 before the output for each @var{file}.
1430
1431 @cindex BSD @code{tail}
1432 @sc{gnu} @code{tail} can output any amount of data (some other versions of
1433 @code{tail} cannot).  It also has no @samp{-r} option (print in
1434 reverse), since reversing a file is really a different job from printing
1435 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1436 only reverse files that are at most as large as its buffer, which is
1437 typically 32k.  A more reliable and versatile way to reverse files is
1438 the @sc{gnu} @code{tac} command.
1439
1440 @code{tail} accepts two option formats: the new one, in which numbers
1441 are arguments to the options (@samp{-n 1}), and the old one, in which
1442 the number precedes any option letters (@samp{-1} or @samp{+1}).
1443
1444 If any option-argument is a number @var{n} starting with a @samp{+},
1445 @code{tail} begins printing with the @var{n}th item from the start of
1446 each file, instead of from the end.
1447
1448 The program accepts the following options.  Also see @ref{Common options}.
1449
1450 @table @samp
1451
1452 @item -@var{count}
1453 @itemx +@var{count}
1454 @opindex -@var{count}
1455 @opindex +@var{count}
1456 This option is only recognized if it is specified first.  @var{count} is
1457 a decimal number optionally followed by a size letter (@samp{b},
1458 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1459 or other option letters (@samp{cfqv}).
1460
1461 @item -c @var{bytes}
1462 @itemx --bytes=@var{bytes}
1463 @opindex -c
1464 @opindex --bytes
1465 Output the last @var{bytes} bytes, instead of final lines.  Appending
1466 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1467 by 1048576.
1468
1469 @item -f
1470 @itemx --follow[=@var{how}]
1471 @opindex -f
1472 @opindex --follow
1473 @cindex growing files
1474 @vindex name @r{follow option}
1475 @vindex descriptor @r{follow option}
1476 Loop forever trying to read more characters at the end of the file,
1477 presumably because the file is growing.  This option is ignored when
1478 reading from a pipe.
1479 If more than one file is given, @code{tail} prints a header whenever it
1480 gets output from a different file, to indicate which file that output is
1481 from.
1482
1483 There are two ways to specify how you'd like to track files with this option,
1484 but that difference is noticeable only when a followed file is removed or
1485 renamed.
1486 If you'd like to continue to track the end of a growing file even after
1487 it has been unlinked, use @samp{--follow=descriptor}.  This is the default
1488 behavior, but it is not useful if you're tracking a log file that may be
1489 rotated (removed or renamed, then reopened).  In that case, use
1490 @samp{--follow=name} to track the named file by reopening it periodically
1491 to see if it has been removed and recreated by some other program.
1492
1493 No matter which method you use, if the tracked file is determined to have
1494 shrunk, @code{tail} prints a message saying the file has been truncated
1495 and resumes tracking the end of the file from the newly-determined endpoint.
1496
1497 When a file is removed, @code{tail}'s behavior depends on whether it is
1498 following the name or the descriptor.  When following by name, tail can
1499 detect that a file has been removed and gives a message to that effect,
1500 and if @samp{--retry} has been specified it will continue checking
1501 periodically to see if the file reappears.
1502 When following a descriptor, tail does not detect that the file has
1503 been unlinked or renamed and issues no message;  even though the file
1504 may no longer be accessible via its original name, it may still be
1505 growing.
1506
1507 The option values @samp{descriptor} and @samp{name} may be specified only
1508 with the long form of the option, not with @samp{-f}.
1509
1510 @itemx --retry
1511 @opindex --retry
1512 This option is meaningful only when following by name.
1513 Without this option, when tail encounters a file that doesn't
1514 exist or is otherwise inaccessible, it reports that fact and
1515 never checks it again.
1516
1517 @itemx --sleep-interval=@var{n}
1518 @opindex --sleep-interval
1519 Change the number of seconds to wait between iterations (the default is 1).
1520 During one iteration, every specified file is checked to see if it has
1521 changed size.
1522
1523 @itemx --pid=@var{pid}
1524 @opindex --pid
1525 When following by name or by descriptor, you may specify the process ID,
1526 @var{pid}, of the sole writer of all @var{file} arguments.  Then, shortly
1527 after that process terminates, tail will also terminate.  This will
1528 work properly only if the writer and the tailing process are running on
1529 the same machine.  For example, to save the output of a build in a file
1530 and to watch the file grow, if you invoke @code{make} and @code{tail}
1531 like this then the tail process will stop when your build completes.
1532 Without this option, you would have had to kill the @code{tail -f}
1533 process yourself.
1534 @example
1535 $ make >& makerr & tail --pid=$! -f makerr
1536 @end example
1537 If you specify a @var{pid} that is not in use or that does not correspond
1538 to the process that is writing to the tailed files, then @code{tail}
1539 may terminate long before any @var{file}s stop growing or it may not
1540 terminate until long after the real writer has terminated.
1541 Note that @samp{--pid} cannot be supported on some systems; @code{tail}
1542 will print a warning if this is the case.
1543
1544 @itemx --max-unchanged-stats=@var{n}
1545 @opindex --max-unchanged-stats
1546 When tailing a file by name, if there have been @var{n} (default
1547 N=@value{DEFAULT_MAX_N_UNCHANGED_STATS_BETWEEN_OPENS}) consecutive
1548 iterations for which the size has remained the same, then
1549 @code{open}/@code{fstat} the file to determine if that file name is
1550 still associated with the same device/inode-number pair as before.
1551 When following a log file that is rotated, this is approximately the
1552 number of seconds between when tail prints the last pre-rotation lines
1553 and when it prints the lines that have accumulated in the new log file.
1554 This option is meaningful only when following by name.
1555
1556 @itemx -n @var{n}
1557 @itemx --lines=@var{n}
1558 @opindex -n
1559 @opindex --lines
1560 Output the last @var{n} lines.
1561
1562 @item -q
1563 @itemx -quiet
1564 @itemx --silent
1565 @opindex -q
1566 @opindex --quiet
1567 @opindex --silent
1568 Never print file name headers.
1569
1570 @item -v
1571 @itemx --verbose
1572 @opindex -v
1573 @opindex --verbose
1574 Always print file name headers.
1575
1576 @end table
1577
1578
1579 @node split invocation
1580 @section @code{split}: Split a file into fixed-size pieces
1581
1582 @pindex split
1583 @cindex splitting a file into pieces
1584 @cindex pieces, splitting a file into
1585
1586 @code{split} creates output files containing consecutive sections of
1587 @var{input} (standard input if none is given or @var{input} is
1588 @samp{-}).  Synopsis:
1589
1590 @example
1591 split [@var{option}] [@var{input} [@var{prefix}]]
1592 @end example
1593
1594 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1595 left over for the last section), into each output file.
1596
1597 @cindex output file name prefix
1598 The output files' names consist of @var{prefix} (@samp{x} by default)
1599 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1600 that concatenating the output files in sorted order by file name produces
1601 the original input file.  (If more than 676 output files are required,
1602 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1603
1604 The program accepts the following options.  Also see @ref{Common options}.
1605
1606 @table @samp
1607
1608 @item -@var{lines}
1609 @itemx -l @var{lines}
1610 @itemx --lines=@var{lines}
1611 @opindex -l
1612 @opindex --lines
1613 Put @var{lines} lines of @var{input} into each output file.
1614
1615 @item -b @var{bytes}
1616 @itemx --bytes=@var{bytes}
1617 @opindex -b
1618 @opindex --bytes
1619 Put the first @var{bytes} bytes of @var{input} into each output file.
1620 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1621 @samp{m} by 1048576.
1622
1623 @item -C @var{bytes}
1624 @itemx --line-bytes=@var{bytes}
1625 @opindex -C
1626 @opindex --line-bytes
1627 Put into each output file as many complete lines of @var{input} as
1628 possible without exceeding @var{bytes} bytes.  For lines longer than
1629 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1630 less than @var{bytes} bytes of the line are left, then continue
1631 normally.  @var{bytes} has the same format as for the @samp{--bytes}
1632 option.
1633
1634 @itemx --verbose
1635 @opindex --verbose
1636 Write a diagnostic to standard error just before each output file is opened.
1637
1638 @end table
1639
1640
1641 @node csplit invocation
1642 @section @code{csplit}: Split a file into context-determined pieces
1643
1644 @pindex csplit
1645 @cindex context splitting
1646 @cindex splitting a file into pieces by context
1647
1648 @code{csplit} creates zero or more output files containing sections of
1649 @var{input} (standard input if @var{input} is @samp{-}).  Synopsis:
1650
1651 @example
1652 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1653 @end example
1654
1655 The contents of the output files are determined by the @var{pattern}
1656 arguments, as detailed below.  An error occurs if a @var{pattern}
1657 argument refers to a nonexistent line of the input file (e.g., if no
1658 remaining line matches a given regular expression).  After every
1659 @var{pattern} has been matched, any remaining input is copied into one
1660 last output file.
1661
1662 By default, @code{csplit} prints the number of bytes written to each
1663 output file after it has been created.
1664
1665 The types of pattern arguments are:
1666
1667 @table @samp
1668
1669 @item @var{n}
1670 Create an output file containing the input up to but not including line
1671 @var{n} (a positive integer).  If followed by a repeat count, also
1672 create an output file containing the next @var{line} lines of the input
1673 file once for each repeat.
1674
1675 @item /@var{regexp}/[@var{offset}]
1676 Create an output file containing the current line up to (but not
1677 including) the next line of the input file that contains a match for
1678 @var{regexp}.  The optional @var{offset} is a @samp{+} or @samp{-}
1679 followed by a positive integer.  If it is given, the input up to the
1680 matching line plus or minus @var{offset} is put into the output file,
1681 and the line after that begins the next section of input.
1682
1683 @item %@var{regexp}%[@var{offset}]
1684 Like the previous type, except that it does not create an output
1685 file, so that section of the input file is effectively ignored.
1686
1687 @item @{@var{repeat-count}@}
1688 Repeat the previous pattern @var{repeat-count} additional
1689 times. @var{repeat-count} can either be a positive integer or an
1690 asterisk, meaning repeat as many times as necessary until the input is
1691 exhausted.
1692
1693 @end table
1694
1695 The output files' names consist of a prefix (@samp{xx} by default)
1696 followed by a suffix.  By default, the suffix is an ascending sequence
1697 of two-digit decimal numbers from @samp{00} to @samp{99}.  In any case,
1698 concatenating the output files in sorted order by filename produces the
1699 original input file.
1700
1701 By default, if @code{csplit} encounters an error or receives a hangup,
1702 interrupt, quit, or terminate signal, it removes any output files
1703 that it has created so far before it exits.
1704
1705 The program accepts the following options.  Also see @ref{Common options}.
1706
1707 @table @samp
1708
1709 @item -f @var{prefix}
1710 @itemx --prefix=@var{prefix}
1711 @opindex -f
1712 @opindex --prefix
1713 @cindex output file name prefix
1714 Use @var{prefix} as the output file name prefix.
1715
1716 @item -b @var{suffix}
1717 @itemx --suffix=@var{suffix}
1718 @opindex -b
1719 @opindex --suffix
1720 @cindex output file name suffix
1721 Use @var{suffix} as the output file name suffix.  When this option is
1722 specified, the suffix string must include exactly one
1723 @code{printf(3)}-style conversion specification, possibly including
1724 format specification flags, a field width, a precision specifications,
1725 or all of these kinds of modifiers.  The format letter must convert a
1726 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1727 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed.  The
1728 entire @var{suffix} is given (with the current output file number) to
1729 @code{sprintf(3)} to form the file name suffixes for each of the
1730 individual output files in turn.  If this option is used, the
1731 @samp{--digits} option is ignored.
1732
1733 @item -n @var{digits}
1734 @itemx --digits=@var{digits}
1735 @opindex -n
1736 @opindex --digits
1737 Use output file names containing numbers that are @var{digits} digits
1738 long instead of the default 2.
1739
1740 @item -k
1741 @itemx --keep-files
1742 @opindex -k
1743 @opindex --keep-files
1744 Do not remove output files when errors are encountered.
1745
1746 @item -z
1747 @itemx --elide-empty-files
1748 @opindex -z
1749 @opindex --elide-empty-files
1750 Suppress the generation of zero-length output files.  (In cases where
1751 the section delimiters of the input file are supposed to mark the first
1752 lines of each of the sections, the first output file will generally be a
1753 zero-length file unless you use this option.)  The output file sequence
1754 numbers always run consecutively starting from 0, even when this option
1755 is specified.
1756
1757 @item -s
1758 @itemx -q
1759 @itemx --silent
1760 @itemx --quiet
1761 @opindex -s
1762 @opindex -q
1763 @opindex --silent
1764 @opindex --quiet
1765 Do not print counts of output file sizes.
1766
1767 @end table
1768
1769
1770 @node Summarizing files
1771 @chapter Summarizing files
1772
1773 @cindex summarizing files
1774
1775 These commands generate just a few numbers representing entire
1776 contents of files.
1777
1778 @menu
1779 * wc invocation::               Print byte, word, and line counts.
1780 * sum invocation::              Print checksum and block counts.
1781 * cksum invocation::            Print CRC checksum and byte counts.
1782 * md5sum invocation::           Print or check message-digests.
1783 @end menu
1784
1785
1786 @node wc invocation
1787 @section @code{wc}: Print byte, word, and line counts
1788
1789 @pindex wc
1790 @cindex byte count
1791 @cindex character count
1792 @cindex word count
1793 @cindex line count
1794
1795 @code{wc} counts the number of bytes, characters, whitespace-separated
1796 words, and newlines in each given @var{file}, or standard input if none
1797 are given or for a @var{file} of @samp{-}.  Synopsis:
1798
1799 @example
1800 wc [@var{option}]@dots{} [@var{file}]@dots{}
1801 @end example
1802
1803 @cindex total counts
1804 @vindex POSIXLY_CORRECT
1805 @code{wc} prints one line of counts for each file, and if the file was
1806 given as an argument, it prints the file name following the counts.  If
1807 more than one @var{file} is given, @code{wc} prints a final line
1808 containing the cumulative counts, with the file name @file{total}.  The
1809 counts are printed in this order: newlines, words, characters, bytes.
1810 By default, each count is output right-justified in a 7-byte field with
1811 one space between fields so that the numbers and file names line up nicely
1812 in columns.  However, @sc{posix} requires that there be exactly one space
1813 separating columns.  You can make @code{wc} use the @sc{posix}-mandated
1814 output format by setting the @env{POSIXLY_CORRECT} environment variable.
1815
1816 By default, @code{wc} prints three counts: the newline, words, and byte
1817 counts.  Options can specify that only certain counts be printed.
1818 Options do not undo others previously given, so
1819
1820 @example
1821 wc --bytes --words
1822 @end example
1823
1824 @noindent
1825 prints both the byte counts and the word counts.
1826
1827 With the @code{--max-line-length} option, @code{wc} prints the length
1828 of the longest line per file, and if there is more than one file it
1829 prints the maximum (not the sum) of those lengths.
1830
1831 The program accepts the following options.  Also see @ref{Common options}.
1832
1833 @table @samp
1834
1835 @item -c
1836 @itemx --bytes
1837 @opindex -c
1838 @opindex --bytes
1839 Print only the byte counts.
1840
1841 @item -m
1842 @itemx --chars
1843 @opindex -m
1844 @opindex --chars
1845 Print only the character counts.
1846
1847 @item -w
1848 @itemx --words
1849 @opindex -w
1850 @opindex --words
1851 Print only the word counts.
1852
1853 @item -l
1854 @itemx --lines
1855 @opindex -l
1856 @opindex --lines
1857 Print only the newline counts.
1858
1859 @item -L
1860 @itemx --max-line-length
1861 @opindex -L
1862 @opindex --max-line-length
1863 Print only the maximum line lengths.
1864
1865 @end table
1866
1867
1868 @node sum invocation
1869 @section @code{sum}: Print checksum and block counts
1870
1871 @pindex sum
1872 @cindex 16-bit checksum
1873 @cindex checksum, 16-bit
1874
1875 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1876 standard input if none are given or for a @var{file} of @samp{-}.  Synopsis:
1877
1878 @example
1879 sum [@var{option}]@dots{} [@var{file}]@dots{}
1880 @end example
1881
1882 @code{sum} prints the checksum for each @var{file} followed by the
1883 number of blocks in the file (rounded up).  If more than one @var{file}
1884 is given, file names are also printed (by default).  (With the
1885 @samp{--sysv} option, corresponding file names are printed when there is
1886 at least one file argument.)
1887
1888 By default, @sc{gnu} @code{sum} computes checksums using an algorithm
1889 compatible with BSD @code{sum} and prints file sizes in units of
1890 1024-byte blocks.
1891
1892 The program accepts the following options.  Also see @ref{Common options}.
1893
1894 @table @samp
1895
1896 @item -r
1897 @opindex -r
1898 @cindex BSD @code{sum}
1899 Use the default (BSD compatible) algorithm.  This option is included for
1900 compatibility with the System V @code{sum}.  Unless @samp{-s} was also
1901 given, it has no effect.
1902
1903 @item -s
1904 @itemx --sysv
1905 @opindex -s
1906 @opindex --sysv
1907 @cindex System V @code{sum}
1908 Compute checksums using an algorithm compatible with System V
1909 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1910
1911 @end table
1912
1913 @code{sum} is provided for compatibility; the @code{cksum} program (see
1914 next section) is preferable in new applications.
1915
1916
1917 @node cksum invocation
1918 @section @code{cksum}: Print CRC checksum and byte counts
1919
1920 @pindex cksum
1921 @cindex cyclic redundancy check
1922 @cindex CRC checksum
1923
1924 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1925 given @var{file}, or standard input if none are given or for a
1926 @var{file} of @samp{-}.  Synopsis:
1927
1928 @example
1929 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1930 @end example
1931
1932 @code{cksum} prints the CRC checksum for each file along with the number
1933 of bytes in the file, and the filename unless no arguments were given.
1934
1935 @code{cksum} is typically used to ensure that files
1936 transferred by unreliable means (e.g., netnews) have not been corrupted,
1937 by comparing the @code{cksum} output for the received files with the
1938 @code{cksum} output for the original files (typically given in the
1939 distribution).
1940
1941 The CRC algorithm is specified by the @sc{posix.2} standard.  It is not
1942 compatible with the BSD or System V @code{sum} algorithms (see the
1943 previous section); it is more robust.
1944
1945 The only options are @samp{--help} and @samp{--version}.  @xref{Common
1946 options}.
1947
1948
1949 @node md5sum invocation
1950 @section @code{md5sum}: Print or check message-digests
1951
1952 @pindex md5sum
1953 @cindex 128-bit checksum
1954 @cindex checksum, 128-bit
1955 @cindex fingerprint, 128-bit
1956 @cindex message-digest, 128-bit
1957
1958 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1959 @dfn{message-digest}) for each specified @var{file}.
1960 If a @var{file} is specified as @samp{-} or if no files are given
1961 @code{md5sum} computes the checksum for the standard input.
1962 @code{md5sum} can also determine whether a file and checksum are
1963 consistent. Synopses:
1964
1965 @example
1966 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1967 md5sum [@var{option}]@dots{} --check [@var{file}]
1968 @end example
1969
1970 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1971 indicating a binary or text input file, and the filename.
1972 If @var{file} is omitted or specified as @samp{-}, standard input is read.
1973
1974 The program accepts the following options.  Also see @ref{Common options}.
1975
1976 @table @samp
1977
1978 @item -b
1979 @itemx --binary
1980 @opindex -b
1981 @opindex --binary
1982 @cindex binary input files
1983 Treat all input files as binary.  This option has no effect on Unix
1984 systems, since they don't distinguish between binary and text files.
1985 This option is useful on systems that have different internal and
1986 external character representations.  On MS-DOS and MS-Windows, this is
1987 the default.
1988
1989 @item -c
1990 @itemx --check
1991 Read filenames and checksum information from the single @var{file}
1992 (or from stdin if no @var{file} was specified) and report whether
1993 each named file and the corresponding checksum data are consistent.
1994 The input to this mode of @code{md5sum} is usually the output of
1995 a prior, checksum-generating run of @samp{md5sum}.
1996 Each valid line of input consists of an MD5 checksum, a binary/text
1997 flag, and then a filename.
1998 Binary files are marked with @samp{*}, text with @samp{ }.
1999 For each such line, @code{md5sum} reads the named file and computes its
2000 MD5 checksum.  Then, if the computed message digest does not match the
2001 one on the line with the filename, the file is noted as having
2002 failed the test.  Otherwise, the file passes the test.
2003 By default, for each valid line, one line is written to standard
2004 output indicating whether the named file passed the test.
2005 After all checks have been performed, if there were any failures,
2006 a warning is issued to standard error.
2007 Use the @samp{--status} option to inhibit that output.
2008 If any listed file cannot be opened or read, if any valid line has
2009 an MD5 checksum inconsistent with the associated file, or if no valid
2010 line is found, @code{md5sum} exits with nonzero status.  Otherwise,
2011 it exits successfully.
2012
2013 @itemx --status
2014 @opindex --status
2015 @cindex verifying MD5 checksums
2016 This option is useful only when verifying checksums.
2017 When verifying checksums, don't generate the default one-line-per-file
2018 diagnostic and don't output the warning summarizing any failures.
2019 Failures to open or read a file still evoke individual diagnostics to
2020 standard error.
2021 If all listed files are readable and are consistent with the associated
2022 MD5 checksums, exit successfully.  Otherwise exit with a status code
2023 indicating there was a failure.
2024
2025 @item -t
2026 @itemx --text
2027 @opindex -t
2028 @opindex --text
2029 @cindex text input files
2030 Treat all input files as text files.  This is the reverse of
2031 @samp{--binary}.
2032
2033 @item -w
2034 @itemx --warn
2035 @opindex -w
2036 @opindex --warn
2037 @cindex verifying MD5 checksums
2038 When verifying checksums, warn about improperly formatted MD5 checksum lines.
2039 This option is useful only if all but a few lines in the checked input
2040 are valid.
2041
2042 @end table
2043
2044
2045 @node Operating on sorted files
2046 @chapter Operating on sorted files
2047
2048 @cindex operating on sorted files
2049 @cindex sorted files, operations on
2050
2051 These commands work with (or produce) sorted files.
2052
2053 @menu
2054 * sort invocation::             Sort text files.
2055 * uniq invocation::             Uniquify files.
2056 * comm invocation::             Compare two sorted files line by line.
2057 * ptx invocation::              Produce a permuted index of file contents.
2058 * tsort invocation::            Topological sort.
2059 @end menu
2060
2061
2062 @node sort invocation
2063 @section @code{sort}: Sort text files
2064
2065 @pindex sort
2066 @cindex sorting files
2067
2068 @code{sort} sorts, merges, or compares all the lines from the given
2069 files, or standard input if none are given or for a @var{file} of
2070 @samp{-}.  By default, @code{sort} writes the results to standard
2071 output.  Synopsis:
2072
2073 @example
2074 sort [@var{option}]@dots{} [@var{file}]@dots{}
2075 @end example
2076
2077 @code{sort} has three modes of operation: sort (the default), merge,
2078 and check for sortedness.  The following options change the operation
2079 mode:
2080
2081 @table @samp
2082
2083 @item -c
2084 @opindex -c
2085 @cindex checking for sortedness
2086 Check whether the given files are already sorted: if they are not all
2087 sorted, print an error message and exit with a status of 1.
2088 Otherwise, exit successfully.
2089
2090 @item -m
2091 @opindex -m
2092 @cindex merging sorted files
2093 Merge the given files by sorting them as a group.  Each input file must
2094 always be individually sorted.  It always works to sort instead of
2095 merge; merging is provided because it is faster, in the case where it
2096 works.
2097
2098 @end table
2099
2100 @vindex LC_COLLATE
2101 A pair of lines is compared as follows: if any key fields have been
2102 specified, @code{sort} compares each pair of fields, in the order
2103 specified on the command line, according to the associated ordering
2104 options, until a difference is found or no fields are left.
2105 Unless otherwise specified, all comparisons use the character
2106 collating sequence specified by the @env{LC_COLLATE} locale.
2107
2108 If any of the global options @samp{Mbdfinr} are given but no key fields
2109 are specified, @code{sort} compares the entire lines according to the
2110 global options.
2111
2112 Finally, as a last resort when all keys compare equal (or if no
2113 ordering options were specified at all), @code{sort} compares the entire
2114 lines.  The last resort comparison
2115 honors the @samp{-r} global option.  The @samp{-s} (stable) option
2116 disables this last-resort comparison so that lines in which all fields
2117 compare equal are left in their original relative order.  If no fields
2118 or global options are specified, @samp{-s} has no effect.
2119
2120 @sc{gnu} @code{sort} (as specified for all @sc{gnu} utilities) has no limits on
2121 input line length or restrictions on bytes allowed within lines.  In
2122 addition, if the final byte of an input file is not a newline, @sc{gnu}
2123 @code{sort} silently supplies one.  A line's trailing newline is not
2124 part of the line for comparison purposes.@footnote{@sc{posix}.2-1992
2125 requires that the trailing newline be part of the comparison, and some
2126 @code{sort} implementations obey this requirement, but it is widely
2127 considered to be a bug in the standard and the next version of
2128 @sc{posix}.2 will likely remove this requirement.}
2129
2130 Upon any error, @code{sort} exits with a status of @samp{2}.
2131
2132 @vindex TMPDIR
2133 If the environment variable @env{TMPDIR} is set, @code{sort} uses its
2134 value as the directory for temporary files instead of @file{/tmp}.  The
2135 @samp{-T @var{tempdir}} option in turn overrides the environment
2136 variable.
2137
2138 The following options affect the ordering of output lines.  They may be
2139 specified globally or as part of a specific key field.  If no key
2140 fields are specified, global options apply to comparison of entire
2141 lines; otherwise the global options are inherited by key fields that do
2142 not specify any special options of their own.  In pre-@sc{posix}
2143 versions of @command{sort}, global options affect only later key fields,
2144 so portable shell scripts should specify global options first.
2145
2146 @table @samp
2147
2148 @item -b
2149 @opindex -b
2150 @cindex blanks, ignoring leading
2151 @vindex LC_CTYPE
2152 Ignore leading blanks when finding sort keys in each line.
2153 The @env{LC_CTYPE} locale determines character types.
2154
2155 @item -d
2156 @opindex -d
2157 @cindex phone directory order
2158 @cindex telephone directory order
2159 @vindex LC_CTYPE
2160 Sort in @dfn{phone directory} order: ignore all characters except
2161 letters, digits and blanks when sorting.
2162 The @env{LC_CTYPE} locale determines character types.
2163
2164 @item -f
2165 @opindex -f
2166 @cindex case folding
2167 @vindex LC_CTYPE
2168 Fold lowercase characters into the equivalent uppercase characters when
2169 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
2170 The @env{LC_CTYPE} locale determines character types.
2171
2172 @item -g
2173 @opindex -g
2174 @cindex general numeric sort
2175 @vindex LC_NUMERIC
2176 Sort numerically, using the standard C function @code{strtod} to convert
2177 a prefix of each line to a double-precision floating point number.
2178 This allows floating point numbers to be specified in scientific notation,
2179 like @code{1.0e-34} and @code{10e100}.
2180 The @env{LC_NUMERIC} locale determines the decimal-point character.
2181 Do not report overflow, underflow, or conversion errors.
2182 Use the following collating sequence:
2183
2184 @itemize @bullet
2185 @item
2186 Lines that do not start with numbers (all considered to be equal).
2187 @item
2188 NaNs (``Not a Number'' values, in IEEE floating point arithmetic)
2189 in a consistent but machine-dependent order.
2190 @item
2191 Minus infinity.
2192 @item
2193 Finite numbers in ascending numeric order (with @math{-0} and @math{+0} equal).
2194 @item
2195 Plus infinity.
2196 @end itemize
2197
2198 Use this option only if there is no alternative; it is much slower than
2199 @samp{-n} and it can lose information when converting to floating point.
2200
2201 @item -i
2202 @opindex -i
2203 @cindex unprintable characters, ignoring
2204 @vindex LC_CTYPE
2205 Ignore unprintable characters.
2206 The @env{LC_CTYPE} locale determines character types.
2207
2208 @item -M
2209 @opindex -M
2210 @cindex months, sorting by
2211 @vindex LC_TIME
2212 An initial string, consisting of any amount of whitespace, followed
2213 by a month name abbreviation, is folded to UPPER case and
2214 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
2215 Invalid names compare low to valid names.  The @env{LC_TIME} locale
2216 determines the month spellings.
2217
2218 @item -n
2219 @opindex -n
2220 @cindex numeric sort
2221 @vindex LC_NUMERIC
2222 Sort numerically: the number begins each line; specifically, it consists
2223 of optional whitespace, an optional @samp{-} sign, and zero or more
2224 digits possibly separated by thousands separators, optionally followed
2225 by a decimal-point character and zero or more digits.  The @env{LC_NUMERIC}
2226 locale specifies the decimal-point character and thousands separator.
2227
2228 @code{sort -n} uses what might be considered an unconventional method
2229 to compare strings representing floating point numbers.  Rather than
2230 first converting each string to the C @code{double} type and then
2231 comparing those values, sort aligns the decimal-point characters in the two
2232 strings and compares the strings a character at a time.  One benefit
2233 of using this approach is its speed.  In practice this is much more
2234 efficient than performing the two corresponding string-to-double (or even
2235 string-to-integer) conversions and then comparing doubles.  In addition,
2236 there is no corresponding loss of precision.  Converting each string to
2237 @code{double} before comparison would limit precision to about 16 digits
2238 on most systems.
2239
2240 Neither a leading @samp{+} nor exponential notation is recognized.
2241 To compare such strings numerically, use the @samp{-g} option.
2242
2243 @item -r
2244 @opindex -r
2245 @cindex reverse sorting
2246 Reverse the result of comparison, so that lines with greater key values
2247 appear earlier in the output instead of later.
2248
2249 @end table
2250
2251 Other options are:
2252
2253 @table @samp
2254
2255 @item -o @var{output-file}
2256 @opindex -o
2257 @cindex overwriting of input, allowed
2258 Write output to @var{output-file} instead of standard output.
2259 If @var{output-file} is one of the input files, @code{sort} copies
2260 it to a temporary file before sorting and writing the output to
2261 @var{output-file}.
2262
2263 @item -S @var{size}
2264 @opindex -S
2265 @cindex size for main memory sorting
2266 Use a main-memory sort buffer of the given @var{size}.  By default,
2267 @var{size} is in units of 1,024 bytes.  Appending @samp{%} causes
2268 @var{size} to be interpreted as a percentage of physical memory.
2269 Appending @samp{k} multiplies @var{size} by 1,024 (the default),
2270 @samp{M} by 1,048,576, @samp{G} by 1,073,741,824, and so on for
2271 @samp{T}, @samp{P}, @samp{E}, @samp{Z}, and @samp{Y}.  Appending
2272 @samp{b} causes @var{size} to be interpreted as a byte count, with no
2273 multiplication.
2274
2275 This option can improve the performance of @command{sort} by causing it
2276 to start with a larger or smaller sort buffer than the default.
2277 However, this option affects only the initial buffer size.  The buffer
2278 grows beyond @var{size} if @command{sort} encounters input lines larger
2279 than @var{size}.
2280
2281 @item -t @var{separator}
2282 @opindex -t
2283 @cindex field separator character
2284 Use character @var{separator} as the field separator when finding the
2285 sort keys in each line.  By default, fields are separated by the empty
2286 string between a non-whitespace character and a whitespace character.
2287 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
2288 into fields @w{@samp{ foo}} and @w{@samp{ bar}}.  The field separator is
2289 not considered to be part of either the field preceding or the field
2290 following.  But note that sort fields that extend to the end of the line,
2291 as @samp{-k 2}, or sort fields consisting of a range, as @samp{-k 2,3},
2292 retain the field separators present between the endpoints of the range.
2293
2294 @item -T @var{tempdir}
2295 @opindex -T
2296 @cindex temporary directory
2297 @vindex TMPDIR
2298 Use directory @var{tempdir} to store temporary files, overriding the
2299 @env{TMPDIR} environment variable.  If this option is given more than
2300 once, temporary files are stored in all the directories given.  If you
2301 have a large sort or merge that is I/O-bound, you can often improve
2302 performance by using this option to specify directories on different
2303 disks and controllers.
2304
2305 @item -u
2306 @opindex -u
2307 @cindex uniquifying output
2308 For the default case or the @samp{-m} option, only output the first
2309 of a sequence of lines that compare equal.  For the @samp{-c} option,
2310 check that no pair of consecutive lines compares equal.
2311
2312 @item -k @var{pos1}[,@var{pos2}]
2313 @opindex -k
2314 @cindex sort field
2315 The recommended, @sc{posix}, option for specifying a sort field.  The field
2316 consists of the part of the line between @var{pos1} and @var{pos2} (or the
2317 end of the line, if @var{pos2} is omitted), @emph{inclusive}.
2318 Fields and character positions are numbered starting with 1.
2319 So to sort on the second field, you'd use @samp{-k 2,2}
2320 See below for more examples.
2321
2322 @item -z
2323 @opindex -z
2324 @cindex sort zero-terminated lines
2325 Treat the input as a set of lines, each terminated by a zero byte (@sc{ascii}
2326 @sc{nul} (Null) character) instead of an @sc{ascii} @sc{lf} (Line Feed).
2327 This option can be useful in conjunction with @samp{perl -0} or
2328 @samp{find -print0} and @samp{xargs -0} which do the same in order to
2329 reliably handle arbitrary pathnames (even those which contain Line Feed
2330 characters.)
2331
2332 @item +@var{pos1}[-@var{pos2}]
2333 The obsolete, traditional option for specifying a sort field.  The field
2334 consists of the line between @var{pos1} and up to but @emph{not including}
2335 @var{pos2} (or the end of the line if @var{pos2} is omitted).  Fields
2336 and character positions are numbered starting with 0.  See below.
2337
2338 @end table
2339
2340 In addition, when @sc{gnu} @code{sort} is invoked with exactly one argument,
2341 options @samp{--help} and @samp{--version} are recognized.  @xref{Common
2342 options}.
2343
2344 Historical (BSD and System V) implementations of @code{sort} have
2345 differed in their interpretation of some options, particularly
2346 @samp{-b}, @samp{-f}, and @samp{-n}.  @sc{gnu} sort follows the @sc{posix}
2347 behavior, which is usually (but not always!) like the System V behavior.
2348 According to @sc{posix}, @samp{-n} no longer implies @samp{-b}.  For
2349 consistency, @samp{-M} has been changed in the same way.  This may
2350 affect the meaning of character positions in field specifications in
2351 obscure cases.  The only fix is to add an explicit @samp{-b}.
2352
2353 A position in a sort field specified with the @samp{-k} or @samp{+}
2354 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
2355 of the field to use and @var{c} is the number of the first character
2356 from the beginning of the field (for @samp{+@var{pos}}) or from the end
2357 of the previous field (for @samp{-@var{pos}}).  If the @samp{.@var{c}}
2358 is omitted, it is taken to be the first character in the field.  If the
2359 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
2360 specification is counted from the first nonblank character of the field
2361 (for @samp{+@var{pos}}) or from the first nonblank character following
2362 the previous field (for @samp{-@var{pos}}).
2363
2364 A sort key option may also have any of the option letters @samp{Mbdfinr}
2365 appended to it, in which case the global ordering options are not used
2366 for that particular field.  The @samp{-b} option may be independently
2367 attached to either or both of the @samp{+@var{pos}} and
2368 @samp{-@var{pos}} parts of a field specification, and if it is inherited
2369 from the global options it will be attached to both.
2370 Keys may span multiple fields.
2371
2372 Here are some examples to illustrate various combinations of options.
2373 In them, the @sc{posix} @samp{-k} option is used to specify sort keys rather
2374 than the obsolete @samp{+@var{pos1}-@var{pos2}} syntax.
2375
2376 @itemize @bullet
2377
2378 @item
2379 Sort in descending (reverse) numeric order.
2380
2381 @example
2382 sort -nr
2383 @end example
2384
2385 @item
2386 Sort alphabetically, omitting the first and second fields.
2387 This uses a single key composed of the characters beginning
2388 at the start of field three and extending to the end of each line.
2389
2390 @example
2391 sort -k 3
2392 @end example
2393
2394 @item
2395 Sort numerically on the second field and resolve ties by sorting
2396 alphabetically on the third and fourth characters of field five.
2397 Use @samp{:} as the field delimiter.
2398
2399 @example
2400 sort -t : -k 2,2n -k 5.3,5.4
2401 @end example
2402
2403 Note that if you had written @samp{-k 2} instead of @samp{-k 2,2}
2404 @command{sort} would have used all characters beginning in the second field
2405 and extending to the end of the line as the primary @emph{numeric}
2406 key.  For the large majority of applications, treating keys spanning
2407 more than one field as numeric will not do what you expect.
2408
2409 Also note that the @samp{n} modifier was applied to the field-end
2410 specifier for the first key.  It would have been equivalent to
2411 specify @samp{-k 2n,2} or @samp{-k 2n,2n}.  All modifiers except
2412 @samp{b} apply to the associated @emph{field}, regardless of whether
2413 the modifier character is attached to the field-start and/or the
2414 field-end part of the key specifier.
2415
2416 @item
2417 Sort the password file on the fifth field and ignore any
2418 leading white space.  Sort lines with equal values in field five
2419 on the numeric user ID in field three.
2420
2421 @example
2422 sort -t : -k 5b,5 -k 3,3n /etc/passwd
2423 @end example
2424
2425 An alternative is to use the global numeric modifier @samp{-n}.
2426
2427 @example
2428 sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
2429 @end example
2430
2431 @item
2432 Generate a tags file in case-insensitive sorted order.
2433
2434 @smallexample
2435 find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
2436 @end smallexample
2437
2438 The use of @samp{-print0}, @samp{-z}, and @samp{-0} in this case means
2439 that pathnames that contain Line Feed characters will not get broken up
2440 by the sort operation.
2441
2442 Finally, to ignore both leading and trailing white space, you
2443 could have applied the @samp{b} modifier to the field-end specifier
2444 for the first key,
2445
2446 @example
2447 sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
2448 @end example
2449
2450 or by using the global @samp{-b} modifier instead of @samp{-n}
2451 and an explicit @samp{n} with the second key specifier.
2452
2453 @example
2454 sort -t : -b -k 5,5 -k 3,3n /etc/passwd
2455 @end example
2456
2457 @c This example is a bit contrived and needs more explanation.
2458 @c @item
2459 @c Sort records separated by an arbitrary string by using a pipe to convert
2460 @c each record delimiter string to @samp{\0}, then using sort's -z option,
2461 @c and converting each @samp{\0} back to the original record delimiter.
2462 @c
2463 @c @example
2464 @c printf 'c\n\nb\n\na\n'|perl -0pe 's/\n\n/\n\0/g'|sort -z|perl -0pe 's/\0/\n/g'
2465 @c @end example
2466
2467 @end itemize
2468
2469
2470 @node uniq invocation
2471 @section @code{uniq}: Uniquify files
2472
2473 @pindex uniq
2474 @cindex uniquify files
2475
2476 @code{uniq} writes the unique lines in the given @file{input}, or
2477 standard input if nothing is given or for an @var{input} name of
2478 @samp{-}.  Synopsis:
2479
2480 @example
2481 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
2482 @end example
2483
2484 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
2485 discards all but one of identical successive lines.  Optionally, it can
2486 instead show only lines that appear exactly once, or lines that appear
2487 more than once.
2488
2489 The input must be sorted.  If your input is not sorted, perhaps you want
2490 to use @code{sort -u}.
2491
2492 If no @var{output} file is specified, @code{uniq} writes to standard
2493 output.
2494
2495 The program accepts the following options.  Also see @ref{Common options}.
2496
2497 @table @samp
2498
2499 @item -@var{n}
2500 @itemx -f @var{n}
2501 @itemx --skip-fields=@var{n}
2502 @opindex -@var{n}
2503 @opindex -f
2504 @opindex --skip-fields
2505 Skip @var{n} fields on each line before checking for uniqueness.  Fields
2506 are sequences of non-space non-tab characters that are separated from
2507 each other by at least one space or tab.
2508
2509 @item +@var{n}
2510 @itemx -s @var{n}
2511 @itemx --skip-chars=@var{n}
2512 @opindex +@var{n}
2513 @opindex -s
2514 @opindex --skip-chars
2515 Skip @var{n} characters before checking for uniqueness.  If you use both
2516 the field and character skipping options, fields are skipped over first.
2517
2518 @item -c
2519 @itemx --count
2520 @opindex -c
2521 @opindex --count
2522 Print the number of times each line occurred along with the line.
2523
2524 @item -i
2525 @itemx --ignore-case
2526 @opindex -i
2527 @opindex --ignore-case
2528 Ignore differences in case when comparing lines.
2529
2530 @item -d
2531 @itemx --repeated
2532 @opindex -d
2533 @opindex --repeated
2534 @cindex duplicate lines, outputting
2535 Print only duplicate lines.
2536
2537 @item -D
2538 @itemx --all-repeated
2539 @opindex -D
2540 @opindex --all-repeated
2541 @cindex all duplicate lines, outputting
2542 Print all duplicate lines and only duplicate lines.
2543 This option is useful mainly in conjunction with other options e.g.,
2544 to ignore case or to compare only selected fields.
2545 This is a @sc{gnu} extension.
2546 @c FIXME: give an example showing *how* it's useful
2547
2548 @item -u
2549 @itemx --unique
2550 @opindex -u
2551 @opindex --unique
2552 @cindex unique lines, outputting
2553 Print only unique lines.
2554
2555 @item -w @var{n}
2556 @itemx --check-chars=@var{n}
2557 @opindex -w
2558 @opindex --check-chars
2559 Compare @var{n} characters on each line (after skipping any specified
2560 fields and characters).  By default the entire rest of the lines are
2561 compared.
2562
2563 @end table
2564
2565
2566 @node comm invocation
2567 @section @code{comm}: Compare two sorted files line by line
2568
2569 @pindex comm
2570 @cindex line-by-line comparison
2571 @cindex comparing sorted files
2572
2573 @code{comm} writes to standard output lines that are common, and lines
2574 that are unique, to two input files; a file name of @samp{-} means
2575 standard input.  Synopsis:
2576
2577 @example
2578 comm [@var{option}]@dots{} @var{file1} @var{file2}
2579 @end example
2580
2581 @vindex LC_COLLATE
2582 Before @code{comm} can be used, the input files must be sorted using the
2583 collating sequence specified by the @env{LC_COLLATE} locale.
2584 If an input file ends in a non-newline
2585 character, a newline is silently appended.  The @code{sort} command with
2586 no options always outputs a file that is suitable input to @code{comm}.
2587
2588 @cindex differing lines
2589 @cindex common lines
2590 With no options, @code{comm} produces three column output.  Column one
2591 contains lines unique to @var{file1}, column two contains lines unique
2592 to @var{file2}, and column three contains lines common to both files.
2593 Columns are separated by a single TAB character.
2594 @c FIXME: when there's an option to supply an alternative separator
2595 @c string, append `by default' to the above sentence.
2596
2597 @opindex -1
2598 @opindex -2
2599 @opindex -3
2600 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
2601 the corresponding columns.  Also see @ref{Common options}.
2602
2603 Unlike some other comparison utilities, @code{comm} has an exit
2604 status that does not depend on the result of the comparison.
2605 Upon normal completion @code{comm} produces an exit code of zero.
2606 If there is an error it exits with nonzero status.
2607
2608
2609 @node tsort invocation
2610 @section @code{tsort}: Topological sort
2611
2612 @pindex tsort
2613 @cindex topological sort
2614
2615 @code{tsort} performs a topological sort on the given @var{file}, or
2616 standard input if no input file is given or for a @var{file} of
2617 @samp{-}.  Synopsis:
2618
2619 @example
2620 tsort [@var{option}] [@var{file}]
2621 @end example
2622
2623 @code{tsort} reads its input as pairs of strings, separated by blanks,
2624 indicating a partial ordering.  The output is a total ordering that
2625 corresponds to the given partial ordering.
2626
2627 For example
2628
2629 @example
2630 tsort <<EOF
2631 a b c
2632 d
2633 e f
2634 b c d e
2635 EOF
2636 @end example
2637
2638 @noindent
2639 will produce the output
2640
2641 @example
2642 a
2643 b
2644 c
2645 d
2646 e
2647 f
2648 @end example
2649
2650 @code{tsort} will detect cycles in the input and writes the first cycle
2651 encountered to standard error.
2652
2653 Note that for a given partial ordering, generally there is no unique
2654 total ordering.
2655
2656 The only options are @samp{--help} and @samp{--version}.  @xref{Common
2657 options}.
2658
2659
2660 @node ptx invocation
2661 @section @code{ptx}: Produce permuted indexes
2662
2663 @pindex ptx
2664
2665 @code{ptx} reads a text file and essentially produces a permuted index, with
2666 each keyword in its context.  The calling sketch is either one of:
2667
2668 @example
2669 ptx [@var{option} @dots{}] [@var{file} @dots{}]
2670 ptx -G [@var{option} @dots{}] [@var{input} [@var{output}]]
2671 @end example
2672
2673 The @samp{-G} (or its equivalent: @samp{--traditional}) option disables
2674 all @sc{gnu} extensions and reverts to traditional mode, thus introducing some
2675 limitations and changing several of the program's default option values.
2676 When @samp{-G} is not specified, @sc{gnu} extensions are always enabled.
2677 @sc{gnu} extensions to @code{ptx} are documented wherever appropriate in this
2678 document.  For the full list, see @xref{Compatibility in ptx}.
2679
2680 Individual options are explained in the following sections.
2681
2682 When @sc{gnu} extensions are enabled, there may be zero, one or several
2683 @var{file}s after the options.  If there is no @var{file}, the program
2684 reads the standard input.  If there is one or several @var{file}s, they
2685 give the name of input files which are all read in turn, as if all the
2686 input files were concatenated.  However, there is a full contextual
2687 break between each file and, when automatic referencing is requested,
2688 file names and line numbers refer to individual text input files.  In
2689 all cases, the program outputs the permuted index to the standard
2690 output.
2691
2692 When @sc{gnu} extensions are @emph{not} enabled, that is, when the program
2693 operates in traditional mode, there may be zero, one or two parameters
2694 besides the options.  If there are no parameters, the program reads the
2695 standard input and outputs the permuted index to the standard output.
2696 If there is only one parameter, it names the text @var{input} to be read
2697 instead of the standard input.  If two parameters are given, they give
2698 respectively the name of the @var{input} file to read and the name of
2699 the @var{output} file to produce.  @emph{Be very careful} to note that,
2700 in this case, the contents of file given by the second parameter is
2701 destroyed.  This behavior is dictated by System V @code{ptx}
2702 compatibility; @sc{gnu} Standards normally discourage output parameters not
2703 introduced by an option.
2704
2705 Note that for @emph{any} file named as the value of an option or as an
2706 input text file, a single dash @kbd{-} may be used, in which case
2707 standard input is assumed.  However, it would not make sense to use this
2708 convention more than once per program invocation.
2709
2710 @menu
2711 * General options in ptx::      Options which affect general program behavior.
2712 * Charset selection in ptx::    Underlying character set considerations.
2713 * Input processing in ptx::     Input fields, contexts, and keyword selection.
2714 * Output formatting in ptx::    Types of output format, and sizing the fields.
2715 * Compatibility in ptx::
2716 @end menu
2717
2718
2719 @node General options in ptx
2720 @subsection General options
2721
2722 @table @samp
2723
2724 @item -C
2725 @itemx --copyright
2726 Print a short note about the copyright and copying conditions, then
2727 exit without further processing.
2728
2729 @item -G
2730 @itemx --traditional
2731 As already explained, this option disables all @sc{gnu} extensions to
2732 @code{ptx} and switches to traditional mode.
2733
2734 @item --help
2735 Print a short help on standard output, then exit without further
2736 processing.
2737
2738 @item --version
2739 Print the program version on standard output, then exit without further
2740 processing.
2741
2742 @end table
2743
2744
2745 @node Charset selection in ptx
2746 @subsection Charset selection
2747
2748 @c FIXME:  People don't necessarily know what an IBM-PC was these days.
2749 As it is set up now, the program assumes that the input file is coded
2750 using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
2751 @emph{unless} it is compiled for MS-DOS, in which case it uses the
2752 character set of the IBM-PC.  (@sc{gnu} @code{ptx} is not known to work on
2753 smaller MS-DOS machines anymore.)  Compared to 7-bit @sc{ascii}, the set
2754 of characters which are letters is different; this alters the behavior
2755 of regular expression matching.  Thus, the default regular expression
2756 for a keyword allows foreign or diacriticized letters.  Keyword sorting,
2757 however, is still crude; it obeys the underlying character set ordering
2758 quite blindly.
2759
2760 @table @samp
2761
2762 @item -f
2763 @itemx --ignore-case
2764 Fold lower case letters to upper case for sorting.
2765
2766 @end table
2767
2768
2769 @node Input processing in ptx
2770 @subsection Word selection and input processing
2771
2772 @table @samp
2773
2774 @item -b @var{file}
2775 @item --break-file=@var{file}
2776
2777 This option provides an alternative (to @samp{-W}) method of describing
2778 which characters make up words.  It introduces the name of a
2779 file which contains a list of characters which can@emph{not} be part of
2780 one word; this file is called the @dfn{Break file}.  Any character which
2781 is not part of the Break file is a word constituent.  If both options
2782 @samp{-b} and @samp{-W} are specified, then @samp{-W} has precedence and
2783 @samp{-b} is ignored.
2784
2785 When @sc{gnu} extensions are enabled, the only way to avoid newline as a
2786 break character is to write all the break characters in the file with no
2787 newline at all, not even at the end of the file.  When @sc{gnu} extensions
2788 are disabled, spaces, tabs and newlines are always considered as break
2789 characters even if not included in the Break file.
2790
2791 @item -i @var{file}
2792 @itemx --ignore-file=@var{file}
2793
2794 The file associated with this option contains a list of words which will
2795 never be taken as keywords in concordance output.  It is called the
2796 @dfn{Ignore file}.  The file contains exactly one word in each line; the
2797 end of line separation of words is not subject to the value of the
2798 @samp{-S} option.
2799
2800 There is a default Ignore file used by @code{ptx} when this option is
2801 not specified, usually found in @file{/usr/local/lib/eign} if this has
2802 not been changed at installation time.  If you want to deactivate the
2803 default Ignore file, specify @code{/dev/null} instead.
2804
2805 @item -o @var{file}
2806 @itemx --only-file=@var{file}
2807
2808 The file associated with this option contains a list of words which will
2809 be retained in concordance output; any word not mentioned in this file
2810 is ignored.  The file is called the @dfn{Only file}.  The file contains
2811 exactly one word in each line; the end of line separation of words is
2812 not subject to the value of the @samp{-S} option.
2813
2814 There is no default for the Only file.  When both an Only file and an
2815 Ignore file are specified, a word is considered a keyword only
2816 if it is listed in the Only file and not in the Ignore file.
2817
2818 @item -r
2819 @itemx --references
2820
2821 On each input line, the leading sequence of non-white space characters will be
2822 taken to be a reference that has the purpose of identifying this input
2823 line in the resulting permuted index.  For more information about reference
2824 production, see @xref{Output formatting in ptx}.
2825 Using this option changes the default value for option @samp{-S}.
2826
2827 Using this option, the program does not try very hard to remove
2828 references from contexts in output, but it succeeds in doing so
2829 @emph{when} the context ends exactly at the newline.  If option
2830 @samp{-r} is used with @samp{-S} default value, or when @sc{gnu} extensions
2831 are disabled, this condition is always met and references are completely
2832 excluded from the output contexts.
2833
2834 @item -S @var{regexp}
2835 @itemx --sentence-regexp=@var{regexp}
2836
2837 This option selects which regular expression will describe the end of a
2838 line or the end of a sentence.  In fact, this regular expression is not
2839 the only distinction between end of lines or end of sentences, and input
2840 line boundaries have no special significance outside this option.  By
2841 default, when @sc{gnu} extensions are enabled and if @samp{-r} option is not
2842 used, end of sentences are used.  In this case, this @var{regex} is
2843 imported from @sc{gnu} Emacs:
2844
2845 @example
2846 [.?!][]\"')@}]*\\($\\|\t\\|  \\)[ \t\n]*
2847 @end example
2848
2849 Whenever @sc{gnu} extensions are disabled or if @samp{-r} option is used, end
2850 of lines are used; in this case, the default @var{regexp} is just:
2851
2852 @example
2853 \n
2854 @end example
2855
2856 Using an empty @var{regexp} is equivalent to completely disabling end of
2857 line or end of sentence recognition.  In this case, the whole file is
2858 considered to be a single big line or sentence.  The user might want to
2859 disallow all truncation flag generation as well, through option @samp{-F
2860 ""}.  @xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs
2861 Manual}.
2862
2863 When the keywords happen to be near the beginning of the input line or
2864 sentence, this often creates an unused area at the beginning of the
2865 output context line; when the keywords happen to be near the end of the
2866 input line or sentence, this often creates an unused area at the end of
2867 the output context line.  The program tries to fill those unused areas
2868 by wrapping around context in them; the tail of the input line or
2869 sentence is used to fill the unused area on the left of the output line;
2870 the head of the input line or sentence is used to fill the unused area
2871 on the right of the output line.
2872
2873 As a matter of convenience to the user, many usual backslashed escape
2874 sequences from the C language are recognized and converted to the
2875 corresponding characters by @code{ptx} itself.
2876
2877 @item -W @var{regexp}
2878 @itemx --word-regexp=@var{regexp}
2879
2880 This option selects which regular expression will describe each keyword.
2881 By default, if @sc{gnu} extensions are enabled, a word is a sequence of
2882 letters; the @var{regexp} used is @samp{\w+}.  When @sc{gnu} extensions are
2883 disabled, a word is by default anything which ends with a space, a tab
2884 or a newline; the @var{regexp} used is @samp{[^ \t\n]+}.
2885
2886 An empty @var{regexp} is equivalent to not using this option.
2887 @xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs
2888 Manual}.
2889
2890 As a matter of convenience to the user, many usual backslashed escape
2891 sequences, as found in the C language, are recognized and converted to
2892 the corresponding characters by @code{ptx} itself.
2893
2894 @end table
2895
2896
2897 @node Output formatting in ptx
2898 @subsection Output formatting
2899
2900 Output format is mainly controlled by the @samp{-O} and @samp{-T} options
2901 described in the table below.  When neither @samp{-O} nor @samp{-T} are
2902 selected, and if @sc{gnu} extensions are enabled, the program chooses an
2903 output format suitable for a dumb terminal.  Each keyword occurrence is
2904 output to the center of one line, surrounded by its left and right
2905 contexts.  Each field is properly justified, so the concordance output
2906 can be readily observed.  As a special feature, if automatic
2907 references are selected by option @samp{-A} and are output before the
2908 left context, that is, if option @samp{-R} is @emph{not} selected, then
2909 a colon is added after the reference; this nicely interfaces with @sc{gnu}
2910 Emacs @code{next-error} processing.  In this default output format, each
2911 white space character, like newline and tab, is merely changed to
2912 exactly one space, with no special attempt to compress consecutive
2913 spaces.  This might change in the future.  Except for those white space
2914 characters, every other character of the underlying set of 256
2915 characters is transmitted verbatim.
2916
2917 Output format is further controlled by the following options.
2918
2919 @table @samp
2920
2921 @item -g @var{number}
2922 @itemx --gap-size=@var{number}
2923
2924 Select the size of the minimum white space gap between the fields on the
2925 output line.
2926
2927 @item -w @var{number}
2928 @itemx --width=@var{number}
2929
2930 Select the maximum output width of each final line.  If references are
2931 used, they are included or excluded from the maximum output width
2932 depending on the value of option @samp{-R}.  If this option is not
2933 selected, that is, when references are output before the left context,
2934 the maximum output width takes into account the maximum length of all
2935 references.  If this option is selected, that is, when references are
2936 output after the right context, the maximum output width does not take
2937 into account the space taken by references, nor the gap that precedes
2938 them.
2939
2940 @item -A
2941 @itemx --auto-reference
2942
2943 Select automatic references.  Each input line will have an automatic
2944 reference made up of the file name and the line ordinal, with a single
2945 colon between them.  However, the file name will be empty when standard
2946 input is being read.  If both @samp{-A} and @samp{-r} are selected, then
2947 the input reference is still read and skipped, but the automatic
2948 reference is used at output time, overriding the input reference.
2949
2950 @item -R
2951 @itemx --right-side-refs
2952
2953 In the default output format, when option @samp{-R} is not used, any
2954 references produced by the effect of options @samp{-r} or @samp{-A} are
2955 placed to the far right of output lines, after the right context.  With
2956 default output format, when the @samp{-R} option is specified, references
2957 are rather placed at the beginning of each output line, before the left
2958 context.  For any other output format, option @samp{-R} is
2959 ignored, with one exception:  with @samp{-R} the width of references
2960 is @emph{not} taken into account in total output width given by @samp{-w}.
2961
2962 This option is automatically selected whenever @sc{gnu} extensions are
2963 disabled.
2964
2965 @item -F @var{string}
2966 @itemx --flac-truncation=@var{string}
2967
2968 This option will request that any truncation in the output be reported
2969 using the string @var{string}.  Most output fields theoretically extend
2970 towards the beginning or the end of the current line, or current
2971 sentence, as selected with option @samp{-S}.  But there is a maximum
2972 allowed output line width, changeable through option @samp{-w}, which is
2973 further divided into space for various output fields.  When a field has
2974 to be truncated because it cannot extend beyond the beginning or the end of
2975 the current line to fit in, then a truncation occurs.  By default,
2976 the string used is a single slash, as in @samp{-F /}.
2977
2978 @var{string} may have more than one character, as in @samp{-F ...}.
2979 Also, in the particular case when @var{string} is empty (@samp{-F ""}),
2980 truncation flagging is disabled, and no truncation marks are appended in
2981 this case.
2982
2983 As a matter of convenience to the user, many usual backslashed escape
2984 sequences, as found in the C language, are recognized and converted to
2985 the corresponding characters by @code{ptx} itself.
2986
2987 @item -M @var{string}
2988 @itemx --macro-name=@var{string}
2989
2990 Select another @var{string} to be used instead of @samp{xx}, while
2991 generating output suitable for @code{nroff}, @code{troff} or @TeX{}.
2992
2993 @item -O
2994 @itemx --format=roff
2995
2996 Choose an output format suitable for @code{nroff} or @code{troff}
2997 processing.  Each output line will look like:
2998
2999 @smallexample
3000 .xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}"
3001 @end smallexample
3002
3003 so it will be possible to write a @samp{.xx} roff macro to take care of
3004 the output typesetting.  This is the default output format when @sc{gnu}
3005 extensions are disabled.  Option @samp{-M} can be used to change
3006 @samp{xx} to another macro name.
3007
3008 In this output format, each non-graphical character, like newline and
3009 tab, is merely changed to exactly one space, with no special attempt to
3010 compress consecutive spaces.  Each quote character: @kbd{"} is doubled
3011 so it will be correctly processed by @code{nroff} or @code{troff}.
3012
3013 @item -T
3014 @itemx --format=tex
3015
3016 Choose an output format suitable for @TeX{} processing.  Each output
3017 line will look like:
3018
3019 @smallexample
3020 \xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@}
3021 @end smallexample
3022
3023 @noindent
3024 so it will be possible to write a @code{\xx} definition to take care of
3025 the output typesetting.  Note that when references are not being
3026 produced, that is, neither option @samp{-A} nor option @samp{-r} is
3027 selected, the last parameter of each @code{\xx} call is inhibited.
3028 Option @samp{-M} can be used to change @samp{xx} to another macro
3029 name.
3030
3031 In this output format, some special characters, like @kbd{$}, @kbd{%},
3032 @kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a
3033 backslash.  Curly brackets @kbd{@{}, @kbd{@}} are protected with a
3034 backslash and a pair of dollar signs (to force mathematical mode).  The
3035 backslash itself produces the sequence @code{\backslash@{@}}.
3036 Circumflex and tilde diacritics produce the sequence @code{^\@{ @}} and
3037 @code{~\@{ @}} respectively.  Other diacriticized characters of the
3038 underlying character set produce an appropriate @TeX{} sequence as far
3039 as possible.  The other non-graphical characters, like newline and tab,
3040 and all other characters which are not part of @sc{ascii}, are merely
3041 changed to exactly one space, with no special attempt to compress
3042 consecutive spaces.  Let me know how to improve this special character
3043 processing for @TeX{}.
3044
3045 @end table
3046
3047
3048 @node Compatibility in ptx
3049 @subsection The @sc{gnu} extensions to @code{ptx}
3050
3051 This version of @code{ptx} contains a few features which do not exist in
3052 System V @code{ptx}.  These extra features are suppressed by using the
3053 @samp{-G} command line option, unless overridden by other command line
3054 options.  Some @sc{gnu} extensions cannot be recovered by overriding, so the
3055 simple rule is to avoid @samp{-G} if you care about @sc{gnu} extensions.
3056 Here are the differences between this program and System V @code{ptx}.
3057
3058 @itemize @bullet
3059
3060 @item
3061 This program can read many input files at once, it always writes the
3062 resulting concordance on standard output.  On the other hand, System V
3063 @code{ptx} reads only one file and sends the result to standard output
3064 or, if a second @var{file} parameter is given on the command, to that
3065 @var{file}.
3066
3067 Having output parameters not introduced by options is a dangerous
3068 practice which @sc{gnu} avoids as far as possible.  So, for using @code{ptx}
3069 portably between @sc{gnu} and System V, you should always use it with a
3070 single input file, and always expect the result on standard output.  You
3071 might also want to automatically configure in a @samp{-G} option to
3072 @code{ptx} calls in products using @code{ptx}, if the configurator finds
3073 that the installed @code{ptx} accepts @samp{-G}.
3074
3075 @item
3076 The only options available in System V @code{ptx} are options @samp{-b},
3077 @samp{-f}, @samp{-g}, @samp{-i}, @samp{-o}, @samp{-r}, @samp{-t} and
3078 @samp{-w}.  All other options are @sc{gnu} extensions and are not repeated in
3079 this enumeration.  Moreover, some options have a slightly different
3080 meaning when @sc{gnu} extensions are enabled, as explained below.
3081
3082 @item
3083 By default, concordance output is not formatted for @code{troff} or
3084 @code{nroff}.  It is rather formatted for a dumb terminal.  @code{troff}
3085 or @code{nroff} output may still be selected through option @samp{-O}.
3086
3087 @item
3088 Unless @samp{-R} option is used, the maximum reference width is
3089 subtracted from the total output line width.  With @sc{gnu} extensions
3090 disabled, width of references is not taken into account in the output
3091 line width computations.
3092
3093 @item
3094 All 256 characters, even @kbd{NUL}s, are always read and processed from
3095 input file with no adverse effect, even if @sc{gnu} extensions are disabled.
3096 However, System V @code{ptx} does not accept 8-bit characters, a few
3097 control characters are rejected, and the tilde @kbd{~} is also rejected.
3098
3099 @item
3100 Input line length is only limited by available memory, even if @sc{gnu}
3101 extensions are disabled.  However, System V @code{ptx} processes only
3102 the first 200 characters in each line.
3103
3104 @item
3105 The break (non-word) characters default to be every character except all
3106 letters of the underlying character set, diacriticized or not.  When @sc{gnu}
3107 extensions are disabled, the break characters default to space, tab and
3108 newline only.
3109
3110 @item
3111 The program makes better use of output line width.  If @sc{gnu} extensions
3112 are disabled, the program rather tries to imitate System V @code{ptx},
3113 but still, there are some slight disposition glitches this program does
3114 not completely reproduce.
3115
3116 @item
3117 The user can specify both an Ignore file and an Only file.  This is not
3118 allowed with System V @code{ptx}.
3119
3120 @end itemize
3121
3122
3123 @node Operating on fields within a line
3124 @chapter Operating on fields within a line
3125
3126 @menu
3127 * cut invocation::              Print selected parts of lines.
3128 * paste invocation::            Merge lines of files.
3129 * join invocation::             Join lines on a common field.
3130 @end menu
3131
3132
3133 @node cut invocation
3134 @section @code{cut}: Print selected parts of lines
3135
3136 @pindex cut
3137 @code{cut} writes to standard output selected parts of each line of each
3138 input file, or standard input if no files are given or for a file name of
3139 @samp{-}.  Synopsis:
3140
3141 @example
3142 cut [@var{option}]@dots{} [@var{file}]@dots{}
3143 @end example
3144
3145 In the table which follows, the @var{byte-list}, @var{character-list},
3146 and @var{field-list} are one or more numbers or ranges (two numbers
3147 separated by a dash) separated by commas.  Bytes, characters, and
3148 fields are numbered starting at 1.  Incomplete ranges may be
3149 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
3150 @samp{@var{n}} through end of line or last field.
3151
3152 The program accepts the following options.  Also see @ref{Common
3153 options}.
3154
3155 @table @samp
3156
3157 @item -b @var{byte-list}
3158 @itemx --bytes=@var{byte-list}
3159 @opindex -b
3160 @opindex --bytes
3161 Print only the bytes in positions listed in @var{byte-list}.  Tabs and
3162 backspaces are treated like any other character; they take up 1 byte.
3163
3164 @item -c @var{character-list}
3165 @itemx --characters=@var{character-list}
3166 @opindex -c
3167 @opindex --characters
3168 Print only characters in positions listed in @var{character-list}.
3169 The same as @samp{-b} for now, but internationalization will change
3170 that.  Tabs and backspaces are treated like any other character; they
3171 take up 1 character.
3172
3173 @item -f @var{field-list}
3174 @itemx --fields=@var{field-list}
3175 @opindex -f
3176 @opindex --fields
3177 Print only the fields listed in @var{field-list}.  Fields are
3178 separated by a TAB character by default.
3179 Also print any line that contains no delimiter character, unless
3180 the @samp{--only-delimited} (@samp{-s}) option is specified
3181
3182 @item -d @var{input_delim_byte}
3183 @itemx --delimiter=@var{input_delim_byte}
3184 @opindex -d
3185 @opindex --delimiter
3186 For @samp{-f}, fields are separated in the input by the first character
3187 in @var{input_delim_byte} (default is TAB).
3188
3189 @item -n
3190 @opindex -n
3191 Do not split multi-byte characters (no-op for now).
3192
3193 @item -s
3194 @itemx --only-delimited
3195 @opindex -s
3196 @opindex --only-delimited
3197 For @samp{-f}, do not print lines that do not contain the field separator
3198 character.
3199
3200 @itemx --output-delimiter=@var{output_delim_string}
3201 @opindex --output-delimiter
3202 For @samp{-f}, output fields are separated by @var{output_delim_string}.
3203 The default is to use the input delimiter.
3204
3205
3206 @end table
3207
3208
3209 @node paste invocation
3210 @section @code{paste}: Merge lines of files
3211
3212 @pindex paste
3213 @cindex merging files
3214
3215 @code{paste} writes to standard output lines consisting of sequentially
3216 corresponding lines of each given file, separated by a TAB character.
3217 Standard input is used for a file name of @samp{-} or if no input files
3218 are given.
3219
3220 Synopsis:
3221
3222 @example
3223 paste [@var{option}]@dots{} [@var{file}]@dots{}
3224 @end example
3225
3226 The program accepts the following options.  Also see @ref{Common options}.
3227
3228 @table @samp
3229
3230 @item -s
3231 @itemx --serial
3232 @opindex -s
3233 @opindex --serial
3234 Paste the lines of one file at a time rather than one line from each
3235 file.
3236
3237 @item -d @var{delim-list}
3238 @itemx --delimiters=@var{delim-list}
3239 @opindex -d
3240 @opindex --delimiters
3241 Consecutively use the characters in @var{delim-list} instead of
3242 TAB to separate merged lines.  When @var{delim-list} is
3243 exhausted, start again at its beginning.
3244
3245 @end table
3246
3247
3248 @node join invocation
3249 @section @code{join}: Join lines on a common field
3250
3251 @pindex join
3252 @cindex common field, joining on
3253
3254 @code{join} writes to standard output a line for each pair of input
3255 lines that have identical join fields.  Synopsis:
3256
3257 @example
3258 join [@var{option}]@dots{} @var{file1} @var{file2}
3259 @end example
3260
3261 @vindex LC_COLLATE
3262 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
3263 meaning standard input.  @var{file1} and @var{file2} should be already
3264 sorted in increasing textual order on the join fields, using the
3265 collating sequence specified by the @env{LC_COLLATE} locale.  Unless
3266 the @samp{-t} option is given, the input should be sorted ignoring blanks at
3267 the start of the join field, as in @code{sort -b}.  If the
3268 @samp{--ignore-case} option is given, lines should be sorted without
3269 regard to the case of characters in the join field, as in @code{sort -f}.
3270
3271 The defaults are: the join field is the first field in each line;
3272 fields in the input are separated by one or more blanks, with leading
3273 blanks on the line ignored; fields in the output are separated by a
3274 space; each output line consists of the join field, the remaining
3275 fields from @var{file1}, then the remaining fields from @var{file2}.
3276
3277 The program accepts the following options.  Also see @ref{Common options}.
3278
3279 @table @samp
3280
3281 @item -a @var{file-number}
3282 @opindex -a
3283 Print a line for each unpairable line in file @var{file-number} (either
3284 @samp{1} or @samp{2}), in addition to the normal output.
3285
3286 @item -e @var{string}
3287 @opindex -e
3288 Replace those output fields that are missing in the input with
3289 @var{string}.
3290
3291 @item -i
3292 @itemx --ignore-case
3293 @opindex -i
3294 @opindex --ignore-case
3295 Ignore differences in case when comparing keys.
3296 With this option, the lines of the input files must be ordered in the same way.
3297 Use @samp{sort -f} to produce this ordering.
3298
3299 @item -1 @var{field}
3300 @itemx -j1 @var{field}
3301 @opindex -1
3302 @opindex -j1
3303 Join on field @var{field} (a positive integer) of file 1.
3304
3305 @item -2 @var{field}
3306 @itemx -j2 @var{field}
3307 @opindex -2
3308 @opindex -j2
3309 Join on field @var{field} (a positive integer) of file 2.
3310
3311 @item -j @var{field}
3312 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
3313
3314 @item -o @var{field-list}@dots{}
3315 Construct each output line according to the format in @var{field-list}.
3316 Each element in @var{field-list} is either the single character @samp{0} or
3317 has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
3318 @samp{2} and @var{n} is a positive field number.
3319
3320 A field specification of @samp{0} denotes the join field.
3321 In most cases, the functionality of the @samp{0} field spec
3322 may be reproduced using the explicit @var{m.n} that corresponds
3323 to the join field.  However, when printing unpairable lines
3324 (using either of the @samp{-a} or @samp{-v} options), there is no way
3325 to specify the join field using @var{m.n} in @var{field-list}
3326 if there are unpairable lines in both files.
3327 To give @code{join} that functionality, @sc{posix} invented the @samp{0}
3328 field specification notation.
3329
3330 The elements in @var{field-list}
3331 are separated by commas or blanks.  Multiple @var{field-list}
3332 arguments can be given after a single @samp{-o} option; the values
3333 of all lists given with @samp{-o} are concatenated together.
3334 All output lines -- including those printed because of any -a or -v
3335 option -- are subject to the specified @var{field-list}.
3336
3337 @item -t @var{char}
3338 Use character @var{char} as the input and output field separator.
3339
3340 @item -v @var{file-number}
3341 Print a line for each unpairable line in file @var{file-number}
3342 (either @samp{1} or @samp{2}), instead of the normal output.
3343
3344 @end table
3345
3346 In addition, when @sc{gnu} @code{join} is invoked with exactly one argument,
3347 options @samp{--help} and @samp{--version} are recognized.  @xref{Common
3348 options}.
3349
3350
3351 @node Operating on characters
3352 @chapter Operating on characters
3353
3354 @cindex operating on characters
3355
3356 This commands operate on individual characters.
3357
3358 @menu
3359 * tr invocation::               Translate, squeeze, and/or delete characters.
3360 * expand invocation::           Convert tabs to spaces.
3361 * unexpand invocation::         Convert spaces to tabs.
3362 @end menu
3363
3364
3365 @node tr invocation
3366 @section @code{tr}: Translate, squeeze, and/or delete characters
3367
3368 @pindex tr
3369
3370 Synopsis:
3371
3372 @example
3373 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
3374 @end example
3375
3376 @code{tr} copies standard input to standard output, performing
3377 one of the following operations:
3378
3379 @itemize @bullet
3380 @item
3381 translate, and optionally squeeze repeated characters in the result,
3382 @item
3383 squeeze repeated characters,
3384 @item
3385 delete characters,
3386 @item
3387 delete characters, then squeeze repeated characters from the result.
3388 @end itemize
3389
3390 The @var{set1} and (if given) @var{set2} arguments define ordered
3391 sets of characters, referred to below as @var{set1} and @var{set2}.  These
3392 sets are the characters of the input that @code{tr} operates on.
3393 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
3394 complement (all of the characters that are not in @var{set1}).
3395
3396 @menu
3397 * Character sets::              Specifying sets of characters.
3398 * Translating::                 Changing one characters to another.
3399 * Squeezing::                   Squeezing repeats and deleting.
3400 * Warnings in tr::              Warning messages.
3401 @end menu
3402
3403
3404 @node Character sets
3405 @subsection Specifying sets of characters
3406
3407 @cindex specifying sets of characters
3408
3409 The format of the @var{set1} and @var{set2} arguments resembles
3410 the format of regular expressions; however, they are not regular
3411 expressions, only lists of characters.  Most characters simply
3412 represent themselves in these strings, but the strings can contain
3413 the shorthands listed below, for convenience.  Some of them can be
3414 used only in @var{set1} or @var{set2}, as noted below.
3415
3416 @table @asis
3417
3418 @item Backslash escapes
3419 @cindex backslash escapes
3420
3421 A backslash followed by a character not listed below causes an error
3422 message.
3423
3424 @table @samp
3425 @item \a
3426 Control-G.
3427 @item \b
3428 Control-H.
3429 @item \f
3430 Control-L.
3431 @item \n
3432 Control-J.
3433 @item \r
3434 Control-M.
3435 @item \t
3436 Control-I.
3437 @item \v
3438 Control-K.
3439 @item \@var{ooo}
3440 The character with the value given by @var{ooo}, which is 1 to 3
3441 octal digits,
3442 @item \\
3443 A backslash.
3444 @end table
3445
3446 @item Ranges
3447 @cindex ranges
3448
3449 The notation @samp{@var{m}-@var{n}} expands to all of the characters
3450 from @var{m} through @var{n}, in ascending order.  @var{m} should
3451 collate before @var{n}; if it doesn't, an error results.  As an example,
3452 @samp{0-9} is the same as @samp{0123456789}.
3453
3454 @sc{gnu} @code{tr} does not support the System V syntax that uses square
3455 brackets to enclose ranges.  Translations specified in that format
3456 sometimes work as expected, since the brackets are often transliterated
3457 to themselves.  However, they should be avoided because they sometimes
3458 behave unexpectedly.  For example, @samp{tr -d '[0-9]'} deletes brackets
3459 as well as digits.
3460
3461 Many historically common and even accepted uses of ranges are not
3462 portable.  For example, on @sc{ebcdic} hosts using the @samp{A-Z}
3463 range will not do what most would expect because @samp{A} through @samp{Z}
3464 are not contiguous as they are in @sc{ascii}.
3465 If you can rely on a @sc{posix} compliant version of @code{tr}, then
3466 the best way to work around this is to use character classes (see below).
3467 Otherwise, it is most portable (and most ugly) to enumerate the members
3468 of the ranges.
3469
3470 @item Repeated characters
3471 @cindex repeated characters
3472
3473 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
3474 copies of character @var{c}.  Thus, @samp{[y*6]} is the same as
3475 @samp{yyyyyy}.  The notation @samp{[@var{c}*]} in @var{string2} expands
3476 to as many copies of @var{c} as are needed to make @var{set2} as long as
3477 @var{set1}.  If @var{n} begins with @samp{0}, it is interpreted in
3478 octal, otherwise in decimal.
3479
3480 @item Character classes
3481 @cindex characters classes
3482
3483 The notation @samp{[:@var{class}:]} expands to all of the characters in
3484 the (predefined) class @var{class}.  The characters expand in no
3485 particular order, except for the @code{upper} and @code{lower} classes,
3486 which expand in ascending order.  When the @samp{--delete} (@samp{-d})
3487 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
3488 character class can be used in @var{set2}.  Otherwise, only the
3489 character classes @code{lower} and @code{upper} are accepted in
3490 @var{set2}, and then only if the corresponding character class
3491 (@code{upper} and @code{lower}, respectively) is specified in the same
3492 relative position in @var{set1}.  Doing this specifies case conversion.
3493 The class names are given below; an error results when an invalid class
3494 name is given.
3495
3496 @table @code
3497 @item alnum
3498 @opindex alnum
3499 Letters and digits.
3500 @item alpha
3501 @opindex alpha
3502 Letters.
3503 @item blank
3504 @opindex blank
3505 Horizontal whitespace.
3506 @item cntrl
3507 @opindex cntrl
3508 Control characters.
3509 @item digit
3510 @opindex digit
3511 Digits.
3512 @item graph
3513 @opindex graph
3514 Printable characters, not including space.
3515 @item lower
3516 @opindex lower
3517 Lowercase letters.
3518 @item print
3519 @opindex print
3520 Printable characters, including space.
3521 @item punct
3522 @opindex punct
3523 Punctuation characters.
3524 @item space
3525 @opindex space
3526 Horizontal or vertical whitespace.
3527 @item upper
3528 @opindex upper
3529 Uppercase letters.
3530 @item xdigit
3531 @opindex xdigit
3532 Hexadecimal digits.
3533 @end table
3534
3535 @item Equivalence classes
3536 @cindex equivalence classes
3537
3538 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
3539 equivalent to @var{c}, in no particular order.  Equivalence classes are
3540 a relatively recent invention intended to support non-English alphabets.
3541 But there seems to be no standard way to define them or determine their
3542 contents.  Therefore, they are not fully implemented in @sc{gnu} @code{tr};
3543 each character's equivalence class consists only of that character,
3544 which is of no particular use.
3545
3546 @end table
3547
3548
3549 @node Translating
3550 @subsection Translating
3551
3552 @cindex translating characters
3553
3554 @code{tr} performs translation when @var{set1} and @var{set2} are
3555 both given and the @samp{--delete} (@samp{-d}) option is not given.
3556 @code{tr} translates each character of its input that is in @var{set1}
3557 to the corresponding character in @var{set2}.  Characters not in
3558 @var{set1} are passed through unchanged.  When a character appears more
3559 than once in @var{set1} and the corresponding characters in @var{set2}
3560 are not all the same, only the final one is used.  For example, these
3561 two commands are equivalent:
3562
3563 @example
3564 tr aaa xyz
3565 tr a z
3566 @end example
3567
3568 A common use of @code{tr} is to convert lowercase characters to
3569 uppercase.  This can be done in many ways.  Here are three of them:
3570
3571 @example
3572 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
3573 tr a-z A-Z
3574 tr '[:lower:]' '[:upper:]'
3575 @end example
3576
3577 @noindent
3578 But note that using ranges like @code{a-z} above is not portable.
3579
3580 When @code{tr} is performing translation, @var{set1} and @var{set2}
3581 typically have the same length.  If @var{set1} is shorter than
3582 @var{set2}, the extra characters at the end of @var{set2} are ignored.
3583
3584 On the other hand, making @var{set1} longer than @var{set2} is not
3585 portable; @sc{posix.2} says that the result is undefined.  In this situation,
3586 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
3587 the last character of @var{set2} as many times as necessary.  System V
3588 @code{tr} truncates @var{set1} to the length of @var{set2}.
3589
3590 By default, @sc{gnu} @code{tr} handles this case like BSD @code{tr}.  When
3591 the @samp{--truncate-set1} (@samp{-t}) option is given, @sc{gnu} @code{tr}
3592 handles this case like the System V @code{tr} instead.  This option is
3593 ignored for operations other than translation.
3594
3595 Acting like System V @code{tr} in this case breaks the relatively common
3596 BSD idiom:
3597
3598 @example
3599 tr -cs A-Za-z0-9 '\012'
3600 @end example
3601
3602 @noindent
3603 because it converts only zero bytes (the first element in the
3604 complement of @var{set1}), rather than all non-alphanumerics, to
3605 newlines.
3606
3607 @noindent
3608 By the way, the above idiom is not portable because it uses ranges.
3609 Assuming a @sc{posix} compliant @code{tr}, here is a better way to write it:
3610
3611 @example
3612 tr -cs '[:alnum:]' '[\n*]'
3613 @end example
3614
3615
3616 @node Squeezing
3617 @subsection Squeezing repeats and deleting
3618
3619 @cindex squeezing repeat characters
3620 @cindex deleting characters
3621
3622 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
3623 removes any input characters that are in @var{set1}.
3624
3625 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
3626 @code{tr} replaces each input sequence of a repeated character that
3627 is in @var{set1} with a single occurrence of that character.
3628
3629 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
3630 first performs any deletions using @var{set1}, then squeezes repeats
3631 from any remaining characters using @var{set2}.
3632
3633 The @samp{--squeeze-repeats} option may also be used when translating,
3634 in which case @code{tr} first performs translation, then squeezes
3635 repeats from any remaining characters using @var{set2}.
3636
3637 Here are some examples to illustrate various combinations of options:
3638
3639 @itemize @bullet
3640
3641 @item
3642 Remove all zero bytes:
3643
3644 @example
3645 tr -d '\000'
3646 @end example
3647
3648 @item
3649 Put all words on lines by themselves.  This converts all
3650 non-alphanumeric characters to newlines, then squeezes each string
3651 of repeated newlines into a single newline:
3652
3653 @example
3654 tr -cs '[:alnum:]' '[\n*]'
3655 @end example
3656
3657 @item
3658 Convert each sequence of repeated newlines to a single newline:
3659
3660 @example
3661 tr -s '\n'
3662 @end example
3663
3664 @item
3665 Find doubled occurrences of words in a document.
3666 For example, people often write ``the the'' with the duplicated words
3667 separated by a newline.  The bourne shell script below works first
3668 by converting each sequence of punctuation and blank characters to a
3669 single newline.  That puts each ``word'' on a line by itself.
3670 Next it maps all uppercase characters to lower case, and finally it
3671 runs @code{uniq} with the @samp{-d} option to print out only the words
3672 that were adjacent duplicates.
3673
3674 @example
3675 #!/bin/sh
3676 cat "$@@" \
3677   | tr -s '[:punct:][:blank:]' '\n' \
3678   | tr '[:upper:]' '[:lower:]' \
3679   | uniq -d
3680 @end example
3681
3682 @item
3683 Deleting a small set of characters is usually straightforward.  For example,
3684 to remove all @samp{a}s, @samp{x}s, and @samp{M}s you would do this:
3685
3686 @example
3687 tr -d axM
3688 @end example
3689
3690 However, when @samp{-} is one of those characters, it can be tricky because
3691 @samp{-} has special meanings.  Performing the same task as above but also
3692 removing all @samp{-} characters, we might try @code{tr -d -axM}, but
3693 that would fail because @code{tr} would try to interpret @samp{-a} as
3694 a command-line option.  Alternatively, we could try putting the hyphen
3695 inside the string, @code{tr -d a-xM}, but that wouldn't work either because
3696 it would make @code{tr} interpret @code{a-x} as the range of characters
3697 @samp{a}@dots{}@samp{x} rather than the three.
3698 One way to solve the problem is to put the hyphen at the end of the list
3699 of characters:
3700
3701 @example
3702 tr -d axM-
3703 @end example
3704
3705 More generally, use the character class notation @code{[=c=]}
3706 with @samp{-} (or any other character) in place of the @samp{c}:
3707
3708 @example
3709 tr -d '[=-=]axM'
3710 @end example
3711
3712 Note how single quotes are used in the above example to protect the
3713 square brackets from interpretation by a shell.
3714
3715 @end itemize
3716
3717
3718 @node Warnings in tr
3719 @subsection Warning messages
3720
3721 @vindex POSIXLY_CORRECT
3722 Setting the environment variable @env{POSIXLY_CORRECT} turns off the
3723 following warning and error messages, for strict compliance with
3724 @sc{posix.2}.  Otherwise, the following diagnostics are issued:
3725
3726 @enumerate
3727
3728 @item
3729 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
3730 is not, and @var{set2} is given, @sc{gnu} @code{tr} by default prints
3731 a usage message and exits, because @var{set2} would not be used.
3732 The @sc{posix} specification says that @var{set2} must be ignored in
3733 this case. Silently ignoring arguments is a bad idea.
3734
3735 @item
3736 When an ambiguous octal escape is given.  For example, @samp{\400}
3737 is actually @samp{\40} followed by the digit @samp{0}, because the
3738 value 400 octal does not fit into a single byte.
3739
3740 @end enumerate
3741
3742 @sc{gnu} @code{tr} does not provide complete BSD or System V compatibility.
3743 For example, it is impossible to disable interpretation of the @sc{posix}
3744 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}.  Also, @sc{gnu}
3745 @code{tr} does not delete zero bytes automatically, unlike traditional
3746 Unix versions, which provide no way to preserve zero bytes.
3747
3748
3749 @node expand invocation
3750 @section @code{expand}: Convert tabs to spaces
3751
3752 @pindex expand
3753 @cindex tabs to spaces, converting
3754 @cindex converting tabs to spaces
3755
3756 @code{expand} writes the contents of each given @var{file}, or standard
3757 input if none are given or for a @var{file} of @samp{-}, to standard
3758 output, with tab characters converted to the appropriate number of
3759 spaces.  Synopsis:
3760
3761 @example
3762 expand [@var{option}]@dots{} [@var{file}]@dots{}
3763 @end example
3764
3765 By default, @code{expand} converts all tabs to spaces.  It preserves
3766 backspace characters in the output; they decrement the column count for
3767 tab calculations.  The default action is equivalent to @samp{-8} (set
3768 tabs every 8 columns).
3769
3770 The program accepts the following options.  Also see @ref{Common options}.
3771
3772 @table @samp
3773
3774 @item -@var{tab1}[,@var{tab2}]@dots{}
3775 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3776 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3777 @opindex -@var{tab}
3778 @opindex -t
3779 @opindex --tabs
3780 @cindex tabstops, setting
3781 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3782 (default is 8).  Otherwise, set the tabs at columns @var{tab1},
3783 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
3784 last tabstop given with single spaces.  If the tabstops are specified
3785 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3786 blanks as well as by commas.
3787
3788 @item -i
3789 @itemx --initial
3790 @opindex -i
3791 @opindex --initial
3792 @cindex initial tabs, converting
3793 Only convert initial tabs (those that precede all non-space or non-tab
3794 characters) on each line to spaces.
3795
3796 @end table
3797
3798
3799 @node unexpand invocation
3800 @section @code{unexpand}: Convert spaces to tabs
3801
3802 @pindex unexpand
3803
3804 @code{unexpand} writes the contents of each given @var{file}, or
3805 standard input if none are given or for a @var{file} of @samp{-}, to
3806 standard output, with strings of two or more space or tab characters
3807 converted to as many tabs as possible followed by as many spaces as are
3808 needed.  Synopsis:
3809
3810 @example
3811 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
3812 @end example
3813
3814 By default, @code{unexpand} converts only initial spaces and tabs (those
3815 that precede all non space or tab characters) on each line.  It
3816 preserves backspace characters in the output; they decrement the column
3817 count for tab calculations.  By default, tabs are set at every 8th
3818 column.
3819
3820 The program accepts the following options.  Also see @ref{Common options}.
3821
3822 @table @samp
3823
3824 @item -@var{tab1}[,@var{tab2}]@dots{}
3825 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3826 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3827 @opindex -@var{tab}
3828 @opindex -t
3829 @opindex --tabs
3830 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3831 instead of the default 8.  Otherwise, set the tabs at columns
3832 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
3833 tabs beyond the tabstops given unchanged.  If the tabstops are specified
3834 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3835 blanks as well as by commas.  This option implies the @samp{-a} option.
3836
3837 @item -a
3838 @itemx --all
3839 @opindex -a
3840 @opindex --all
3841 Convert all strings of two or more spaces or tabs, not just initial
3842 ones, to tabs.
3843
3844 @end table
3845
3846 @c              What's GNU?
3847 @c              Arnold Robbins
3848 @node Opening the software toolbox
3849 @chapter Opening the software toolbox
3850
3851 This chapter originally appeared in @cite{Linux Journal}, volume 1,
3852 number 2, in the @cite{What's GNU?} column. It was written by Arnold
3853 Robbins.
3854
3855 @menu
3856 * Toolbox introduction::        Toolbox introduction
3857 * I/O redirection::             I/O redirection
3858 * The who command::             The @code{who} command
3859 * The cut command::             The @code{cut} command
3860 * The sort command::            The @code{sort} command
3861 * The uniq command::            The @code{uniq} command
3862 * Putting the tools together::  Putting the tools together
3863 @end menu
3864
3865
3866 @node Toolbox introduction
3867 @unnumberedsec Toolbox introduction
3868
3869 This month's column is only peripherally related to the @sc{gnu} Project, in
3870 that it describes a number of the @sc{gnu} tools on your Linux system and how
3871 they might be used.  What it's really about is the ``Software Tools'' philosophy
3872 of program development and usage.
3873
3874 The software tools philosophy was an important and integral concept
3875 in the initial design and development of Unix (of which Linux and @sc{gnu} are
3876 essentially clones).  Unfortunately, in the modern day press of
3877 Internetworking and flashy GUIs, it seems to have fallen by the
3878 wayside.  This is a shame, since it provides a powerful mental model
3879 for solving many kinds of problems.
3880
3881 Many people carry a Swiss Army knife around in their pants pockets (or
3882 purse).  A Swiss Army knife is a handy tool to have: it has several knife
3883 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
3884 a number of other things on it.  For the everyday, small miscellaneous jobs
3885 where you need a simple, general purpose tool, it's just the thing.
3886
3887 On the other hand, an experienced carpenter doesn't build a house using
3888 a Swiss Army knife.  Instead, he has a toolbox chock full of specialized
3889 tools---a saw, a hammer, a screwdriver, a plane, and so on.  And he knows
3890 exactly when and where to use each tool; you won't catch him hammering nails
3891 with the handle of his screwdriver.
3892
3893 The Unix developers at Bell Labs were all professional programmers and trained
3894 computer scientists.  They had found that while a one-size-fits-all program
3895 might appeal to a user because there's only one program to use, in practice
3896 such programs are
3897
3898 @enumerate a
3899 @item
3900 difficult to write,
3901
3902 @item
3903 difficult to maintain and
3904 debug, and
3905
3906 @item
3907 difficult to extend to meet new situations.
3908 @end enumerate
3909
3910 Instead, they felt that programs should be specialized tools.  In short, each
3911 program ``should do one thing well.''  No more and no less.  Such programs are
3912 simpler to design, write, and get right---they only do one thing.
3913
3914 Furthermore, they found that with the right machinery for hooking programs
3915 together, that the whole was greater than the sum of the parts.  By combining
3916 several special purpose programs, you could accomplish a specific task
3917 that none of the programs was designed for, and accomplish it much more
3918 quickly and easily than if you had to write a special purpose program.
3919 We will see some (classic) examples of this further on in the column.
3920 (An important additional point was that, if necessary, take a detour
3921 and build any software tools you may need first, if you don't already
3922 have something appropriate in the toolbox.)
3923
3924 @node I/O redirection
3925 @unnumberedsec I/O redirection
3926
3927 Hopefully, you are familiar with the basics of I/O redirection in the
3928 shell, in particular the concepts of ``standard input,'' ``standard output,''
3929 and ``standard error''.  Briefly, ``standard input'' is a data source, where
3930 data comes from.  A program should not need to either know or care if the
3931 data source is a disk file, a keyboard, a magnetic tape, or even a punched
3932 card reader.  Similarly, ``standard output'' is a data sink, where data goes
3933 to.  The program should neither know nor care where this might be.
3934 Programs that only read their standard input, do something to the data,
3935 and then send it on, are called ``filters'', by analogy to filters in a
3936 water pipeline.
3937
3938 With the Unix shell, it's very easy to set up data pipelines:
3939
3940 @smallexample
3941 program_to_create_data | filter1 | .... | filterN > final.pretty.data
3942 @end smallexample
3943
3944 We start out by creating the raw data; each filter applies some successive
3945 transformation to the data, until by the time it comes out of the pipeline,
3946 it is in the desired form.
3947
3948 This is fine and good for standard input and standard output.  Where does the
3949 standard error come in to play?  Well, think about @code{filter1} in
3950 the pipeline above.  What happens if it encounters an error in the data it
3951 sees?  If it writes an error message to standard output, it will just
3952 disappear down the pipeline into @code{filter2}'s input, and the
3953 user will probably never see it.  So programs need a place where they can send
3954 error messages so that the user will notice them.  This is standard error,
3955 and it is usually connected to your console or window, even if you have
3956 redirected standard output of your program away from your screen.
3957
3958 For filter programs to work together, the format of the data has to be
3959 agreed upon.  The most straightforward and easiest format to use is simply
3960 lines of text.  Unix data files are generally just streams of bytes, with
3961 lines delimited by the @sc{ascii} @sc{lf} (Line Feed) character,
3962 conventionally called a ``newline'' in the Unix literature. (This is
3963 @code{'\n'} if you're a C programmer.)  This is the format used by all
3964 the traditional filtering programs.  (Many earlier operating systems
3965 had elaborate facilities and special purpose programs for managing
3966 binary data.  Unix has always shied away from such things, under the
3967 philosophy that it's easiest to simply be able to view and edit your
3968 data with a text editor.)
3969
3970 OK, enough introduction. Let's take a look at some of the tools, and then
3971 we'll see how to hook them together in interesting ways.   In the following
3972 discussion, we will only present those command line options that interest
3973 us.  As you should always do, double check your system documentation
3974 for the full story.
3975
3976 @node The who command
3977 @unnumberedsec The @code{who} command
3978
3979 The first program is the @code{who} command.  By itself, it generates a
3980 list of the users who are currently logged in.  Although I'm writing
3981 this on a single-user system, we'll pretend that several people are
3982 logged in:
3983
3984 @example
3985 $ who
3986 arnold   console Jan 22 19:57
3987 miriam   ttyp0   Jan 23 14:19(:0.0)
3988 bill     ttyp1   Jan 21 09:32(:0.0)
3989 arnold   ttyp2   Jan 23 20:48(:0.0)
3990 @end example
3991
3992 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
3993 There are three people logged in, and I am logged in twice.  On traditional
3994 Unix systems, user names are never more than eight characters long.  This
3995 little bit of trivia will be useful later.  The output of @code{who} is nice,
3996 but the data is not all that exciting.
3997
3998 @node The cut command
3999 @unnumberedsec The @code{cut} command
4000
4001 The next program we'll look at is the @code{cut} command.  This program
4002 cuts out columns or fields of input data.  For example, we can tell it
4003 to print just the login name and full name from the @file{/etc/passwd
4004 file}.  The @file{/etc/passwd} file has seven fields, separated by
4005 colons:
4006
4007 @example
4008 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
4009 @end example
4010
4011 To get the first and fifth fields, we would use cut like this:
4012
4013 @example
4014 $ cut -d: -f1,5 /etc/passwd
4015 root:Operator
4016 @dots{}
4017 arnold:Arnold D. Robbins
4018 miriam:Miriam A. Robbins
4019 @dots{}
4020 @end example
4021
4022 With the @samp{-c} option, @code{cut} will cut out specific characters
4023 (i.e., columns) in the input lines.  This command looks like it might be
4024 useful for data filtering.
4025
4026
4027 @node The sort command
4028 @unnumberedsec The @code{sort} command
4029
4030 Next we'll look at the @code{sort} command.  This is one of the most
4031 powerful commands on a Unix-style system; one that you will often find
4032 yourself using when setting up fancy data plumbing. The @code{sort}
4033 command reads and sorts each file named on the command line.  It then
4034 merges the sorted data and writes it to standard output.  It will read
4035 standard input if no files are given on the command line (thus
4036 making it into a filter).  The sort is based on the character collating
4037 sequence or based on user-supplied ordering criteria.
4038
4039
4040 @node The uniq command
4041 @unnumberedsec The @code{uniq} command
4042
4043 Finally (at least for now), we'll look at the @code{uniq} program.  When
4044 sorting data, you will often end up with duplicate lines, lines that
4045 are identical.  Usually, all you need is one instance of each line.
4046 This is where @code{uniq} comes in. The @code{uniq} program reads its
4047 standard input, which it expects to be sorted.  It only prints out one
4048 copy of each duplicated line.  It does have several options.  Later on,
4049 we'll use the @samp{-c} option, which prints each unique line, preceded
4050 by a count of the number of times that line occurred in the input.
4051
4052
4053 @node Putting the tools together
4054 @unnumberedsec Putting the tools together
4055
4056 Now, let's suppose this is a large BBS system with dozens of users
4057 logged in.  The management wants the SysOp to write a program that will
4058 generate a sorted list of logged in users.  Furthermore, even if a user
4059 is logged in multiple times, his or her name should only show up in the
4060 output once.
4061
4062 The SysOp could sit down with the system documentation and write a C
4063 program that did this. It would take perhaps a couple of hundred lines
4064 of code and about two hours to write it, test it, and debug it.
4065 However, knowing the software toolbox, the SysOp can instead start out
4066 by generating just a list of logged on users:
4067
4068 @example
4069 $ who | cut -c1-8
4070 arnold
4071 miriam
4072 bill
4073 arnold
4074 @end example
4075
4076 Next, sort the list:
4077
4078 @example
4079 $ who | cut -c1-8 | sort
4080 arnold
4081 arnold
4082 bill
4083 miriam
4084 @end example
4085
4086 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
4087
4088 @example
4089 $ who | cut -c1-8 | sort | uniq
4090 arnold
4091 bill
4092 miriam
4093 @end example
4094
4095 The @code{sort} command actually has a @samp{-u} option that does what
4096 @code{uniq} does. However, @code{uniq} has other uses for which one
4097 cannot substitute @samp{sort -u}.
4098
4099 The SysOp puts this pipeline into a shell script, and makes it available for
4100 all the users on the system:
4101
4102 @example
4103 # cat > /usr/local/bin/listusers
4104 who | cut -c1-8 | sort | uniq
4105 ^D
4106 # chmod +x /usr/local/bin/listusers
4107 @end example
4108
4109 There are four major points to note here.  First, with just four
4110 programs, on one command line, the SysOp was able to save about two
4111 hours worth of work.  Furthermore, the shell pipeline is just about as
4112 efficient as the C program would be, and it is much more efficient in
4113 terms of programmer time.  People time is much more expensive than
4114 computer time, and in our modern ``there's never enough time to do
4115 everything'' society, saving two hours of programmer time is no mean
4116 feat.
4117
4118 Second, it is also important to emphasize that with the
4119 @emph{combination} of the tools, it is possible to do a special
4120 purpose job never imagined by the authors of the individual programs.
4121
4122 Third, it is also valuable to build up your pipeline in stages, as we did here.
4123 This allows you to view the data at each stage in the pipeline, which helps
4124 you acquire the confidence that you are indeed using these tools correctly.
4125
4126 Finally, by bundling the pipeline in a shell script, other users can use
4127 your command, without having to remember the fancy plumbing you set up for
4128 them. In terms of how you run them, shell scripts and compiled programs are
4129 indistinguishable.
4130
4131 After the previous warm-up exercise, we'll look at two additional, more
4132 complicated pipelines.  For them, we need to introduce two more tools.
4133
4134 The first is the @code{tr} command, which stands for ``transliterate.''
4135 The @code{tr} command works on a character-by-character basis, changing
4136 characters. Normally it is used for things like mapping upper case to
4137 lower case:
4138
4139 @example
4140 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[:upper:]' '[:lower:]'
4141 this example has mixed case!
4142 @end example
4143
4144 There are several options of interest:
4145
4146 @table @samp
4147 @item -c
4148 work on the complement of the listed characters, i.e.,
4149 operations apply to characters not in the given set
4150
4151 @item -d
4152 delete characters in the first set from the output
4153
4154 @item -s
4155 squeeze repeated characters in the output into just one character.
4156 @end table
4157
4158 We will be using all three options in a moment.
4159
4160 The other command we'll look at is @code{comm}.  The @code{comm}
4161 command takes two sorted input files as input data, and prints out the
4162 files' lines in three columns.  The output columns are the data lines
4163 unique to the first file, the data lines unique to the second file, and
4164 the data lines that are common to both.  The @samp{-1}, @samp{-2}, and
4165 @samp{-3} command line options omit the respective columns. (This is
4166 non-intuitive and takes a little getting used to.)  For example:
4167
4168 @example
4169 $ cat f1
4170 11111
4171 22222
4172 33333
4173 44444
4174 $ cat f2
4175 00000
4176 22222
4177 33333
4178 55555
4179 $ comm f1 f2
4180         00000
4181 11111
4182                 22222
4183                 33333
4184 44444
4185         55555
4186 @end example
4187
4188 The single dash as a filename tells @code{comm} to read standard input
4189 instead of a regular file.
4190
4191 Now we're ready to build a fancy pipeline.  The first application is a word
4192 frequency counter.  This helps an author determine if he or she is over-using
4193 certain words.
4194
4195 The first step is to change the case of all the letters in our input file
4196 to one case.  ``The'' and ``the'' are the same word when doing counting.
4197
4198 @example
4199 $ tr '[:upper:]' '[:lower:]' < whats.gnu | ...
4200 @end example
4201
4202 The next step is to get rid of punctuation.  Quoted words and unquoted words
4203 should be treated identically; it's easiest to just get the punctuation out of
4204 the way.
4205
4206 @smallexample
4207 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' | ...
4208 @end smallexample
4209
4210 The second @code{tr} command operates on the complement of the listed
4211 characters, which are all the letters, the digits, the underscore, and
4212 the blank.  The @samp{\012} represents the newline character; it has to
4213 be left alone.  (The @sc{ascii} tab character should also be included for
4214 good measure in a production script.)
4215
4216 At this point, we have data consisting of words separated by blank space.
4217 The words only contain alphanumeric characters (and the underscore).  The
4218 next step is break the data apart so that we have one word per line. This
4219 makes the counting operation much easier, as we will see shortly.
4220
4221 @smallexample
4222 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4223 > tr -s ' ' '\012' | ...
4224 @end smallexample
4225
4226 This command turns blanks into newlines.  The @samp{-s} option squeezes
4227 multiple newline characters in the output into just one.  This helps us
4228 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
4229 This is what the shell prints when it notices you haven't finished
4230 typing in all of a command.)
4231
4232 We now have data consisting of one word per line, no punctuation, all one
4233 case.  We're ready to count each word:
4234
4235 @smallexample
4236 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4237 > tr -s ' ' '\012' | sort | uniq -c | ...
4238 @end smallexample
4239
4240 At this point, the data might look something like this:
4241
4242 @example
4243   60 a
4244    2 able
4245    6 about
4246    1 above
4247    2 accomplish
4248    1 acquire
4249    1 actually
4250    2 additional
4251 @end example
4252
4253 The output is sorted by word, not by count!  What we want is the most
4254 frequently used words first.  Fortunately, this is easy to accomplish,
4255 with the help of two more @code{sort} options:
4256
4257 @table @samp
4258 @item -n
4259 do a numeric sort, not a textual one
4260
4261 @item -r
4262 reverse the order of the sort
4263 @end table
4264
4265 The final pipeline looks like this:
4266
4267 @smallexample
4268 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4269 > tr -s ' ' '\012' | sort | uniq -c | sort -nr
4270  156 the
4271   60 a
4272   58 to
4273   51 of
4274   51 and
4275  ...
4276 @end smallexample
4277
4278 Whew!  That's a lot to digest.  Yet, the same principles apply. With six
4279 commands, on two lines (really one long one split for convenience), we've
4280 created a program that does something interesting and useful, in much
4281 less time than we could have written a C program to do the same thing.
4282
4283 A minor modification to the above pipeline can give us a simple spelling
4284 checker!  To determine if you've spelled a word correctly, all you have to
4285 do is look it up in a dictionary.  If it is not there, then chances are
4286 that your spelling is incorrect.  So, we need a dictionary.  If you
4287 have the Slackware Linux distribution, you have the file
4288 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
4289 dictionary.
4290
4291 Now, how to compare our file with the dictionary?  As before, we generate
4292 a sorted list of words, one per line:
4293
4294 @smallexample
4295 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4296 > tr -s ' ' '\012' | sort -u | ...
4297 @end smallexample
4298
4299 Now, all we need is a list of words that are @emph{not} in the
4300 dictionary.  Here is where the @code{comm} command comes in.
4301
4302 @smallexample
4303 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4304 > tr -s ' ' '\012' | sort -u |
4305 > comm -23 - /usr/lib/ispell/ispell.words
4306 @end smallexample
4307
4308 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
4309 dictionary (the second file), and lines that are in both files.  Lines
4310 only in the first file (standard input, our stream of words), are
4311 words that are not in the dictionary.  These are likely candidates for
4312 spelling errors.  This pipeline was the first cut at a production
4313 spelling checker on Unix.
4314
4315 There are some other tools that deserve brief mention.
4316
4317 @table @code
4318 @item grep
4319 search files for text that matches a regular expression
4320
4321 @item egrep
4322 like @code{grep}, but with more powerful regular expressions
4323
4324 @item wc
4325 count lines, words, characters
4326
4327 @item tee
4328 a T-fitting for data pipes, copies data to files and to standard output
4329
4330 @item sed
4331 the stream editor, an advanced tool
4332
4333 @item awk
4334 a data manipulation language, another advanced tool
4335 @end table
4336
4337 The software tools philosophy also espoused the following bit of
4338 advice: ``Let someone else do the hard part.'' This means, take
4339 something that gives you most of what you need, and then massage it the
4340 rest of the way until it's in the form that you want.
4341
4342 To summarize:
4343
4344 @enumerate 1
4345 @item
4346 Each program should do one thing well. No more, no less.
4347
4348 @item
4349 Combining programs with appropriate plumbing leads to results where
4350 the whole is greater than the sum of the parts.  It also leads to novel
4351 uses of programs that the authors might never have imagined.
4352
4353 @item
4354 Programs should never print extraneous header or trailer data, since these
4355 could get sent on down a pipeline. (A point we didn't mention earlier.)
4356
4357 @item
4358 Let someone else do the hard part.
4359
4360 @item
4361 Know your toolbox! Use each program appropriately. If you don't have an
4362 appropriate tool, build one.
4363 @end enumerate
4364
4365 As of this writing, all the programs we've discussed are available via
4366 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
4367 @file{/pub/gnu/textutils-1.9.tar.gz}.@footnote{Version 1.9 was current
4368 when this column was written. Check the nearest @sc{gnu} archive for the
4369 current version.  The main @sc{gnu} FTP site is now @code{ftp.gnu.org}.}
4370
4371 None of what I have presented in this column is new. The Software Tools
4372 philosophy was first introduced in the book @cite{Software Tools},
4373 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
4374 0-201-03669-X).   This book showed how to write and use software
4375 tools.   It was written in 1976, using a preprocessor for FORTRAN named
4376 @code{ratfor} (RATional FORtran).  At the time, C was not as ubiquitous
4377 as it is now; FORTRAN was.  The last chapter presented a @code{ratfor}
4378 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
4379 awful lot like C; if you know C, you won't have any problem following
4380 the code.
4381
4382 In 1981, the book was updated and made available as @cite{Software
4383 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7).  Both books
4384 remain in print, and are well worth reading if you're a programmer.
4385 They certainly made a major change in how I view programming.
4386
4387 Initially, the programs in both books were available (on 9-track tape)
4388 from Addison-Wesley.  Unfortunately, this is no longer the case,
4389 although you might be able to find copies floating around the Internet.
4390 For a number of years, there was an active Software Tools Users Group,
4391 whose members had ported the original @code{ratfor} programs to essentially
4392 every computer system with a FORTRAN compiler.  The popularity of the
4393 group waned in the middle '80s as Unix began to spread beyond universities.
4394
4395 With the current proliferation of @sc{gnu} code and other clones of Unix
4396 programs, these programs now receive little attention; modern C versions are
4397 much more efficient and do more than these programs do.  Nevertheless, as
4398 exposition of good programming style, and evangelism for a still-valuable
4399 philosophy, these books are unparalleled, and I recommend them highly.
4400
4401 Acknowledgment: I would like to express my gratitude to Brian Kernighan
4402 of Bell Labs, the original Software Toolsmith, for reviewing this column.
4403
4404
4405 @node Index
4406 @unnumbered Index
4407
4408 @printindex cp
4409
4410 @contents
4411 @bye
4412
4413 @c Local variables:
4414 @c texinfo-column-for-description: 32
4415 @c End: