doc/textutils.texi

   1 \input texinfo
   2 @c %**start of header
   3 @setfilename textutils.info
   4 @settitle GNU text utilities
   5 @c %**end of header
   6
   7 @include version.texi
   8
   9 @c Define new indices.
  10 @defcodeindex op
  11
  12 @c Put everything in one index (arbitrarily chosen to be the concept index).
  13 @syncodeindex fn cp
  14 @syncodeindex ky cp
  15 @syncodeindex op cp
  16 @syncodeindex pg cp
  17 @syncodeindex vr cp
  18
  19 @ifinfo
  20 @format
  21 START-INFO-DIR-ENTRY
  22 * Text utilities: (textutils).          GNU text utilities.
  23 * cat: (textutils)cat invocation.               Concatenate and write files.
  24 * cksum: (textutils)cksum invocation.           Print @sc{posix} CRC checksum.
  25 * comm: (textutils)comm invocation.             Compare sorted files by line.
  26 * csplit: (textutils)csplit invocation.         Split by context.
  27 * cut: (textutils)cut invocation.               Print selected parts of lines.
  28 * expand: (textutils)expand invocation.         Convert tabs to spaces.
  29 * fmt: (textutils)fmt invocation.               Reformat paragraph text.
  30 * fold: (textutils)fold invocation.             Wrap long input lines.
  31 * head: (textutils)head invocation.             Output the first part of files.
  32 * join: (textutils)join invocation.             Join lines on a common field.
  33 * md5sum: (textutils)md5sum invocation.         Print or check message-digests.
  34 * nl: (textutils)nl invocation.                 Number lines and write files.
  35 * od: (textutils)od invocation.                 Dump files in octal, etc.
  36 * paste: (textutils)paste invocation.           Merge lines of files.
  37 * pr: (textutils)pr invocation.                 Paginate or columnate files.
  38 * ptx: (textutils)ptx invocation.               Produce permuted indexes.
  39 * sort: (textutils)sort invocation.             Sort text files.
  40 * split: (textutils)split invocation.           Split into fixed-size pieces.
  41 * sum: (textutils)sum invocation.               Print traditional checksum.
  42 * tac: (textutils)tac invocation.               Reverse files.
  43 * tail: (textutils)tail invocation.             Output the last part of files.
  44 * tsort: (textutils)tsort invocation.           Topological sort.
  45 * tr: (textutils)tr invocation.                 Translate characters.
  46 * unexpand: (textutils)unexpand invocation.     Convert spaces to tabs.
  47 * uniq: (textutils)uniq invocation.             Uniquify files.
  48 * wc: (textutils)wc invocation.                 Byte, word, and line counts.
  49 END-INFO-DIR-ENTRY
  50 @end format
  51 @end ifinfo
  52
  53 @ifinfo
  54 This file documents the GNU text utilities.
  55
  56 Copyright (C) 1994, 95, 96 Free Software Foundation, Inc.
  57
  58 Permission is granted to make and distribute verbatim copies of
  59 this manual provided the copyright notice and this permission notice
  60 are preserved on all copies.
  61
  62 @ignore
  63 Permission is granted to process this file through TeX and print the
  64 results, provided the printed document carries copying permission
  65 notice identical to this one except for the removal of this paragraph
  66 (this paragraph not being relevant to the printed manual).
  67
  68 @end ignore
  69 Permission is granted to copy and distribute modified versions of this
  70 manual under the conditions for verbatim copying, provided that the entire
  71 resulting derived work is distributed under the terms of a permission
  72 notice identical to this one.
  73
  74 Permission is granted to copy and distribute translations of this manual
  75 into another language, under the above conditions for modified versions,
  76 except that this permission notice may be stated in a translation approved
  77 by the Foundation.
  78 @end ifinfo
  79
  80 @titlepage
  81 @title GNU @code{textutils}
  82 @subtitle A set of text utilities
  83 @subtitle for version @value{VERSION}, @value{UPDATED}
  84 @author David MacKenzie et al.
  85
  86 @page
  87 @vskip 0pt plus 1filll
  88 Copyright @copyright{} 1994, 95, 96 Free Software Foundation, Inc.
  89
  90 Permission is granted to make and distribute verbatim copies of
  91 this manual provided the copyright notice and this permission notice
  92 are preserved on all copies.
  93
  94 Permission is granted to copy and distribute modified versions of this
  95 manual under the conditions for verbatim copying, provided that the entire
  96 resulting derived work is distributed under the terms of a permission
  97 notice identical to this one.
  98
  99 Permission is granted to copy and distribute translations of this manual
 100 into another language, under the above conditions for modified versions,
 101 except that this permission notice may be stated in a translation approved
 102 by the Foundation.
 103 @end titlepage
 104
 105
 106 @c If your makeinfo doesn't grok this @ifnottex directive, then either
 107 @c get a newer version of makeinfo or do s/ifnottex/ifinfo/ here and on
 108 @c the matching @end directive below.
 109 @ifnottex
 110 @node Top
 111 @top GNU text utilities
 112
 113 @cindex text utilities
 114 @cindex utilities for text handling
 115
 116 This manual documents version @value{VERSION} of the GNU text utilities.
 117
 118 @menu
 119 * Introduction::                       Caveats, overview, and authors.
 120 * Common options::                     Common options.
 121 * Output of entire files::             cat tac nl od
 122 * Formatting file contents::           fmt pr fold
 123 * Output of parts of files::           head tail split csplit
 124 * Summarizing files::                  wc sum cksum md5sum
 125 * Operating on sorted files::          sort uniq comm ptx tsort
 126 * Operating on fields within a line::  cut paste join
 127 * Operating on characters::            tr expand unexpand
 128 * Opening the software toolbox::       The software tools philosophy.
 129 * Index::                              General index.
 130
 131 @detailmenu
 132  --- The Detailed Node Listing ---
 133
 134 Output of entire files
 135
 136 * cat invocation::              Concatenate and write files.
 137 * tac invocation::              Concatenate and write files in reverse.
 138 * nl invocation::               Number lines and write files.
 139 * od invocation::               Write files in octal or other formats.
 140
 141 Formatting file contents
 142
 143 * fmt invocation::              Reformat paragraph text.
 144 * pr invocation::               Paginate or columnate files for printing.
 145 * fold invocation::             Wrap input lines to fit in specified width.
 146
 147 Output of parts of files
 148
 149 * head invocation::             Output the first part of files.
 150 * tail invocation::             Output the last part of files.
 151 * split invocation::            Split a file into fixed-size pieces.
 152 * csplit invocation::           Split a file into context-determined pieces.
 153
 154 Summarizing files
 155
 156 * wc invocation::               Print byte, word, and line counts.
 157 * sum invocation::              Print checksum and block counts.
 158 * cksum invocation::            Print CRC checksum and byte counts.
 159 * md5sum invocation::           Print or check message-digests.
 160
 161 Operating on sorted files
 162
 163 * sort invocation::             Sort text files.
 164 * uniq invocation::             Uniquify files.
 165 * comm invocation::             Compare two sorted files line by line.
 166 * ptx invocation::              Produce a permuted index of file contents.
 167 * tsort invocation::            Topological sort.
 168
 169 @code{ptx}: Produce permuted indexes
 170
 171 * General options in ptx::      Options which affect general program behavior.
 172 * Charset selection in ptx::    Underlying character set considerations.
 173 * Input processing in ptx::     Input fields, contexts, and keyword selection.
 174 * Output formatting in ptx::    Types of output format, and sizing the fields.
 175 * Compatibility in ptx::        The GNU extensions to @code{ptx}
 176
 177 Operating on fields within a line
 178
 179 * cut invocation::              Print selected parts of lines.
 180 * paste invocation::            Merge lines of files.
 181 * join invocation::             Join lines on a common field.
 182
 183 Operating on characters
 184
 185 * tr invocation::               Translate, squeeze, and/or delete characters.
 186 * expand invocation::           Convert tabs to spaces.
 187 * unexpand invocation::         Convert spaces to tabs.
 188
 189 @code{tr}: Translate, squeeze, and/or delete characters
 190
 191 * Character sets::              Specifying sets of characters.
 192 * Translating::                 Changing one characters to another.
 193 * Squeezing::                   Squeezing repeats and deleting.
 194 * Warnings in tr::              Warning messages.
 195
 196 Opening the software toolbox
 197
 198 * Toolbox introduction::        Toolbox introduction
 199 * I/O redirection::             I/O redirection
 200 * The who command::             The @code{who} command
 201 * The cut command::             The @code{cut} command
 202 * The sort command::            The @code{sort} command
 203 * The uniq command::            The @code{uniq} command
 204 * Putting the tools together::  Putting the tools together
 205
 206 @end detailmenu
 207 @end menu
 208
 209 @end ifnottex
 210
 211
 212 @node Introduction
 213 @chapter Introduction
 214
 215 @cindex introduction
 216
 217 This manual is incomplete: No attempt is made to explain basic concepts
 218 in a way suitable for novices.  Thus, if you are interested, please get
 219 involved in improving this manual.  The entire GNU community will
 220 benefit.
 221
 222 @cindex POSIX.2
 223 The GNU text utilities are mostly compatible with the @sc{posix.2} standard.
 224
 225 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
 226 @c sh-utils.texi too -- so be sure to keep them consistent.
 227 @cindex bugs, reporting
 228 Please report bugs to @email{bug-textutils@@gnu.org}.  Remember
 229 to include the version number, machine architecture, input files, and
 230 any other information needed to reproduce the bug: your input, what you
 231 expected, what you got, and why it is wrong.  Diffs are welcome, but
 232 please include a description of the problem as well, since this is
 233 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
 234
 235 This manual was originally derived from the Unix man pages in the
 236 distribution, which were written by David MacKenzie and updated by Jim
 237 Meyering.  What you are reading now is the authoritative documentation
 238 for these utilities;  the man pages are no longer being maintained.
 239 The original @code{fmt} man page was written by Ross Paterson.
 240 Fran@,{c}ois Pinard did the initial conversion to Texinfo format.
 241 Karl Berry did the indexing, some reorganization, and editing of the results.
 242 Richard Stallman contributed his usual invaluable insights to the
 243 overall process.
 244
 245
 246 @node Common options
 247 @chapter Common options
 248
 249 @cindex common options
 250
 251 Certain options are available in all these programs.  Rather than
 252 writing identical descriptions for each of the programs, they are
 253 described here.  (In fact, every GNU program accepts (or should accept)
 254 these options.)
 255
 256 A few of these programs take arbitrary strings as arguments.  In those
 257 cases, @samp{--help} and @samp{--version} are taken as these options
 258 only if there is one and exactly one command line argument.
 259
 260 @table @samp
 261
 262 @item --help
 263 @opindex --help
 264 @cindex help, online
 265 Print a usage message listing all available options, then exit successfully.
 266
 267 @item --version
 268 @opindex --version
 269 @cindex version number, finding
 270 Print the version number, then exit successfully.
 271
 272 @end table
 273
 274
 275 @node Output of entire files
 276 @chapter Output of entire files
 277
 278 @cindex output of entire files
 279 @cindex entire files, output of
 280
 281 These commands read and write entire files, possibly transforming them
 282 in some way.
 283
 284 @menu
 285 * cat invocation::              Concatenate and write files.
 286 * tac invocation::              Concatenate and write files in reverse.
 287 * nl invocation::               Number lines and write files.
 288 * od invocation::               Write files in octal or other formats.
 289 @end menu
 290
 291 @node cat invocation
 292 @section @code{cat}: Concatenate and write files
 293
 294 @pindex cat
 295 @cindex concatenate and write files
 296 @cindex copying files
 297
 298 @code{cat} copies each @var{file} (@samp{-} means standard input), or
 299 standard input if none are given, to standard output.  Synopsis:
 300
 301 @example
 302 cat [@var{option}] [@var{file}]@dots{}
 303 @end example
 304
 305 The program accepts the following options.  Also see @ref{Common options}.
 306
 307 @table @samp
 308
 309 @item -A
 310 @itemx --show-all
 311 @opindex -A
 312 @opindex --show-all
 313 Equivalent to @samp{-vET}.
 314
 315 @item -B
 316 @itemx --binary
 317 @opindex -B
 318 @opindex --binary
 319 @cindex binary and text I/O in cat
 320 On MS-DOS and MS-Windows only, read and write the
 321 files in binary mode.  By default, @code{cat} on MS-DOS/MS-Windows uses
 322 binary mode only when standard output is redirected to a file or a pipe;
 323 this option overrides that.  Binary file I/O is used so that the files
 324 retain their format (Unix text as opposed to DOS text and binary),
 325 because @code{cat} is frequently used as a file-copying program.  Some
 326 options (see below) cause @code{cat} read and write files in text mode
 327 because then the original file contents aren't important (e.g., when
 328 lines are numbered by @code{cat}, or when line endings should be
 329 marked).  This is so these options work as DOS/Windows users would
 330 expect; for example, DOS-style text files have their lines end with
 331 the CR-LF pair of characters which won't be processed as an empty line
 332 by @samp{-b} unless the file is read in text mode.
 333
 334 @item -b
 335 @itemx --number-nonblank
 336 @opindex -b
 337 @opindex --number-nonblank
 338 Number all nonblank output lines, starting with 1.  On MS-DOS and
 339 MS-Windows, this option causes @code{cat} to read and write files in
 340 text mode.
 341
 342 @item -e
 343 @opindex -e
 344 Equivalent to @samp{-vE}.
 345
 346 @item -E
 347 @itemx --show-ends
 348 @opindex -E
 349 @opindex --show-ends
 350 Display a @samp{$} after the end of each line.  On MS-DOS and
 351 MS-Windows, this option causes @code{cat} to read and write files in
 352 text mode.
 353
 354 @item -n
 355 @itemx --number
 356 @opindex -n
 357 @opindex --number
 358 Number all output lines, starting with 1.  On MS-DOS and MS-Windows,
 359 this option causes @code{cat} to read and write files in text mode.
 360
 361 @item -s
 362 @itemx --squeeze-blank
 363 @opindex -s
 364 @opindex --squeeze-blank
 365 @cindex squeezing blank lines
 366 Replace multiple adjacent blank lines with a single blank line.  On
 367 MS-DOS and MS-Windows, this option causes @code{cat} to read and write
 368 files in text mode.
 369
 370 @item -t
 371 @opindex -t
 372 Equivalent to @samp{-vT}.
 373
 374 @item -T
 375 @itemx --show-tabs
 376 @opindex -T
 377 @opindex --show-tabs
 378 Display TAB characters as @samp{^I}.
 379
 380 @item -u
 381 @opindex -u
 382 Ignored; for Unix compatibility.
 383
 384 @item -v
 385 @itemx --show-nonprinting
 386 @opindex -v
 387 @opindex --show-nonprinting
 388 Display control characters except for LFD and TAB using
 389 @samp{^} notation and precede characters that have the high bit set with
 390 @samp{M-}.  On MS-DOS and MS-Windows, this option causes @code{cat} to
 391 read files and standard input in DOS binary mode, so the CR
 392 characters at the end of each line are also visible.
 393
 394 @end table
 395
 396
 397 @node tac invocation
 398 @section @code{tac}: Concatenate and write files in reverse
 399
 400 @pindex tac
 401 @cindex reversing files
 402
 403 @code{tac} copies each @var{file} (@samp{-} means standard input), or
 404 standard input if none are given, to standard output, reversing the
 405 records (lines by default) in each separately.  Synopsis:
 406
 407 @example
 408 tac [@var{option}]@dots{} [@var{file}]@dots{}
 409 @end example
 410
 411 @dfn{Records} are separated by instances of a string (newline by
 412 default).  By default, this separator string is attached to the end of
 413 the record that it follows in the file.
 414
 415 The program accepts the following options.  Also see @ref{Common options}.
 416
 417 @table @samp
 418
 419 @item -b
 420 @itemx --before
 421 @opindex -b
 422 @opindex --before
 423 The separator is attached to the beginning of the record that it
 424 precedes in the file.
 425
 426 @item -r
 427 @itemx --regex
 428 @opindex -r
 429 @opindex --regex
 430 Treat the separator string as a regular expression.  Users of @code{tac}
 431 on MS-DOS/MS-Windows should note that, since @code{tac} reads files in
 432 binary mode, each line of a text file might end with a CR/LF pair
 433 instead of the Unix-style LF.
 434
 435 @item -s @var{separator}
 436 @itemx --separator=@var{separator}
 437 @opindex -s
 438 @opindex --separator
 439 Use @var{separator} as the record separator, instead of newline.
 440
 441 @end table
 442
 443
 444 @node nl invocation
 445 @section @code{nl}: Number lines and write files
 446
 447 @pindex nl
 448 @cindex numbering lines
 449 @cindex line numbering
 450
 451 @code{nl} writes each @var{file} (@samp{-} means standard input), or
 452 standard input if none are given, to standard output, with line numbers
 453 added to some or all of the lines.  Synopsis:
 454
 455 @example
 456 nl [@var{option}]@dots{} [@var{file}]@dots{}
 457 @end example
 458
 459 @cindex logical pages, numbering on
 460 @code{nl} decomposes its input into (logical) pages; by default, the
 461 line number is reset to 1 at the top of each logical page.  @code{nl}
 462 treats all of the input files as a single document; it does not reset
 463 line numbers or logical pages between files.
 464
 465 @cindex headers, numbering
 466 @cindex body, numbering
 467 @cindex footers, numbering
 468 A logical page consists of three sections: header, body, and footer.
 469 Any of the sections can be empty.  Each can be numbered in a different
 470 style from the others.
 471
 472 The beginnings of the sections of logical pages are indicated in the
 473 input file by a line containing exactly one of these delimiter strings:
 474
 475 @table @samp
 476 @item \:\:\:
 477 start of header;
 478 @item \:\:
 479 start of body;
 480 @item \:
 481 start of footer.
 482 @end table
 483
 484 The two characters from which these strings are made can be changed from
 485 @samp{\} and @samp{:} via options (see below), but the pattern and
 486 length of each string cannot be changed.
 487
 488 A section delimiter is replaced by an empty line on output.  Any text
 489 that comes before the first section delimiter string in the input file
 490 is considered to be part of a body section, so @code{nl} treats a
 491 file that contains no section delimiters as a single body section.
 492
 493 The program accepts the following options.  Also see @ref{Common options}.
 494
 495 @table @samp
 496
 497 @item -b @var{style}
 498 @itemx --body-numbering=@var{style}
 499 @opindex -b
 500 @opindex --body-numbering
 501 Select the numbering style for lines in the body section of each
 502 logical page.  When a line is not numbered, the current line number
 503 is not incremented, but the line number separator character is still
 504 prepended to the line.  The styles are:
 505
 506 @table @samp
 507 @item a
 508 number all lines,
 509 @item t
 510 number only nonempty lines (default for body),
 511 @item n
 512 do not number lines (default for header and footer),
 513 @item p@var{regexp}
 514 number only lines that contain a match for @var{regexp}.
 515 @end table
 516
 517 @item -d @var{cd}
 518 @itemx --section-delimiter=@var{cd}
 519 @opindex -d
 520 @opindex --section-delimiter
 521 @cindex section delimiters of pages
 522 Set the section delimiter characters to @var{cd}; default is
 523 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
 524 (Remember to protect @samp{\} or other metacharacters from shell
 525 expansion with quotes or extra backslashes.)
 526
 527 @item -f @var{style}
 528 @itemx --footer-numbering=@var{style}
 529 @opindex -f
 530 @opindex --footer-numbering
 531 Analogous to @samp{--body-numbering}.
 532
 533 @item -h @var{style}
 534 @itemx --header-numbering=@var{style}
 535 @opindex -h
 536 @opindex --header-numbering
 537 Analogous to @samp{--body-numbering}.
 538
 539 @item -i @var{number}
 540 @itemx --page-increment=@var{number}
 541 @opindex -i
 542 @opindex --page-increment
 543 Increment line numbers by @var{number} (default 1).
 544
 545 @item -l @var{number}
 546 @itemx --join-blank-lines=@var{number}
 547 @opindex -l
 548 @opindex --join-blank-lines
 549 @cindex empty lines, numbering
 550 @cindex blank lines, numbering
 551 Consider @var{number} (default 1) consecutive empty lines to be one
 552 logical line for numbering, and only number the last one.  Where fewer
 553 than @var{number} consecutive empty lines occur, do not number them.
 554 An empty line is one that contains no characters, not even spaces
 555 or tabs.
 556
 557 @item -n @var{format}
 558 @itemx --number-format=@var{format}
 559 @opindex -n
 560 @opindex --number-format
 561 Select the line numbering format (default is @code{rn}):
 562
 563 @table @samp
 564 @item ln
 565 @opindex ln @r{format for @code{nl}}
 566 left justified, no leading zeros;
 567 @item rn
 568 @opindex rn @r{format for @code{nl}}
 569 right justified, no leading zeros;
 570 @item rz
 571 @opindex rz @r{format for @code{nl}}
 572 right justified, leading zeros.
 573 @end table
 574
 575 @item -p
 576 @itemx --no-renumber
 577 @opindex -p
 578 @opindex --no-renumber
 579 Do not reset the line number at the start of a logical page.
 580
 581 @item -s @var{string}
 582 @itemx --number-separator=@var{string}
 583 @opindex -s
 584 @opindex --number-separator
 585 Separate the line number from the text line in the output with
 586 @var{string} (default is the TAB character).
 587
 588 @item -v @var{number}
 589 @itemx --starting-line-number=@var{number}
 590 @opindex -v
 591 @opindex --starting-line-number
 592 Set the initial line number on each logical page to @var{number} (default 1).
 593
 594 @item -w @var{number}
 595 @itemx --number-width=@var{number}
 596 @opindex -w
 597 @opindex --number-width
 598 Use @var{number} characters for line numbers (default 6).
 599
 600 @end table
 601
 602
 603 @node od invocation
 604 @section @code{od}: Write files in octal or other formats
 605
 606 @pindex od
 607 @cindex octal dump of files
 608 @cindex hex dump of files
 609 @cindex ASCII dump of files
 610 @cindex file contents, dumping unambiguously
 611
 612 @code{od} writes an unambiguous representation of each @var{file}
 613 (@samp{-} means standard input), or standard input if none are given.
 614 Synopsis:
 615
 616 @example
 617 od [@var{option}]@dots{} [@var{file}]@dots{}
 618 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
 619 @end example
 620
 621 Each line of output consists of the offset in the input, followed by
 622 groups of data from the file. By default, @code{od} prints the offset in
 623 octal, and each group of file data is two bytes of input printed as a
 624 single octal number.
 625
 626 The program accepts the following options.  Also see @ref{Common options}.
 627
 628 @table @samp
 629
 630 @item -A @var{radix}
 631 @itemx --address-radix=@var{radix}
 632 @opindex -A
 633 @opindex --address-radix
 634 @cindex radix for file offsets
 635 @cindex file offset radix
 636 Select the base in which file offsets are printed.  @var{radix} can
 637 be one of the following:
 638
 639 @table @samp
 640 @item d
 641 decimal;
 642 @item o
 643 octal;
 644 @item x
 645 hexadecimal;
 646 @item n
 647 none (do not print offsets).
 648 @end table
 649
 650 The default is octal.
 651
 652 @item -j @var{bytes}
 653 @itemx --skip-bytes=@var{bytes}
 654 @opindex -j
 655 @opindex --skip-bytes
 656 Skip @var{bytes} input bytes before formatting and writing.  If
 657 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
 658 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
 659 in decimal.  Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
 660 by 1024, and @samp{m} by 1048576.
 661
 662 @item -N @var{bytes}
 663 @itemx --read-bytes=@var{bytes}
 664 @opindex -N
 665 @opindex --read-bytes
 666 Output at most @var{bytes} bytes of the input.  Prefixes and suffixes on
 667 @code{bytes} are interpreted as for the @samp{-j} option.
 668
 669 @item -s [@var{n}]
 670 @itemx --strings[=@var{n}]
 671 @opindex -s
 672 @opindex --strings
 673 @cindex string constants, outputting
 674 Instead of the normal output, output only @dfn{string constants}: at
 675 least @var{n} (3 by default) consecutive @sc{ascii} graphic characters,
 676 followed by a null (zero) byte.
 677
 678 @item -t @var{type}
 679 @itemx --format=@var{type}
 680 @opindex -t
 681 @opindex --format
 682 Select the format in which to output the file data.  @var{type} is a
 683 string of one or more of the below type indicator characters.  If you
 684 include more than one type indicator character in a single @var{type}
 685 string, or use this option more than once, @code{od} writes one copy
 686 of each output line using each of the data types that you specified,
 687 in the order that you specified.
 688
 689 Adding a trailing ``z'' to any type specification appends a display
 690 of the @sc{ascii} character representation of the printable characters
 691 to the output line generated by the type specification.
 692
 693 @table @samp
 694 @item a
 695 named character,
 696 @item c
 697 @sc{ascii} character or backslash escape,
 698 @item d
 699 signed decimal,
 700 @item f
 701 floating point,
 702 @item o
 703 octal,
 704 @item u
 705 unsigned decimal,
 706 @item x
 707 hexadecimal.
 708 @end table
 709
 710 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
 711 newline, and @samp{nul} for a null (zero) byte.  Type @code{c} outputs
 712 @samp{ }, @samp{\n}, and @code{\0}, respectively.
 713
 714 @cindex type size
 715 Except for types @samp{a} and @samp{c}, you can specify the number
 716 of bytes to use in interpreting each number in the given data type
 717 by following the type indicator character with a decimal integer.
 718 Alternately, you can specify the size of one of the C compiler's
 719 built-in data types by following the type indicator character with
 720 one of the following characters.  For integers (@samp{d}, @samp{o},
 721 @samp{u}, @samp{x}):
 722
 723 @table @samp
 724 @item C
 725 char,
 726 @item S
 727 short,
 728 @item I
 729 int,
 730 @item L
 731 long.
 732 @end table
 733
 734 For floating point (@code{f}):
 735
 736 @table @asis
 737 @item F
 738 float,
 739 @item D
 740 double,
 741 @item L
 742 long double.
 743 @end table
 744
 745 @item -v
 746 @itemx --output-duplicates
 747 @opindex -v
 748 @opindex --output-duplicates
 749 Output consecutive lines that are identical.  By default, when two or
 750 more consecutive output lines would be identical, @code{od} outputs only
 751 the first line, and puts just an asterisk on the following line to
 752 indicate the elision.
 753
 754 @item -w[@var{n}]
 755 @itemx --width[=@var{n}]
 756 @opindex -w
 757 @opindex --width
 758 Dump @code{n} input bytes per output line.  This must be a multiple of
 759 the least common multiple of the sizes associated with the specified
 760 output types.  If @var{n} is omitted, the default is 32.  If this option
 761 is not given at all, the default is 16.
 762
 763 @end table
 764
 765 The next several options map the old, pre-@sc{posix} format specification
 766 options to the corresponding @sc{posix} format specs.  GNU @code{od} accepts
 767 any combination of old- and new-style options.  Format specification
 768 options accumulate.
 769
 770 @table @samp
 771
 772 @item -a
 773 @opindex -a
 774 Output as named characters.  Equivalent to @samp{-ta}.
 775
 776 @item -b
 777 @opindex -b
 778 Output as octal bytes.  Equivalent to @samp{-toC}.
 779
 780 @item -c
 781 @opindex -c
 782 Output as @sc{ascii} characters or backslash escapes.  Equivalent to
 783 @samp{-tc}.
 784
 785 @item -d
 786 @opindex -d
 787 Output as unsigned decimal shorts.  Equivalent to @samp{-tu2}.
 788
 789 @item -f
 790 @opindex -f
 791 Output as floats.  Equivalent to @samp{-tfF}.
 792
 793 @item -h
 794 @opindex -h
 795 Output as hexadecimal shorts.  Equivalent to @samp{-tx2}.
 796
 797 @item -i
 798 @opindex -i
 799 Output as decimal shorts.  Equivalent to @samp{-td2}.
 800
 801 @item -l
 802 @opindex -l
 803 Output as decimal longs.  Equivalent to @samp{-td4}.
 804
 805 @item -o
 806 @opindex -o
 807 Output as octal shorts.  Equivalent to @samp{-to2}.
 808
 809 @item -x
 810 @opindex -x
 811 Output as hexadecimal shorts.  Equivalent to @samp{-tx2}.
 812
 813 @item -C
 814 @itemx --traditional
 815 @opindex --traditional
 816 Recognize the pre-POSIX non-option arguments that traditional @code{od}
 817 accepted.  The following syntax:
 818
 819 @example
 820 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
 821 @end example
 822
 823 @noindent
 824 can be used to specify at most one file and optional arguments
 825 specifying an offset and a pseudo-start address, @var{label}.  By
 826 default, @var{offset} is interpreted as an octal number specifying how
 827 many input bytes to skip before formatting and writing.  The optional
 828 trailing decimal point forces the interpretation of @var{offset} as a
 829 decimal number.  If no decimal is specified and the offset begins with
 830 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number.  If
 831 there is a trailing @samp{b}, the number of bytes skipped will be
 832 @var{offset} multiplied by 512.  The @var{label} argument is interpreted
 833 just like @var{offset}, but it specifies an initial pseudo-address.  The
 834 pseudo-addresses are displayed in parentheses following any normal
 835 address.
 836
 837 @end table
 838
 839
 840 @node Formatting file contents
 841 @chapter Formatting file contents
 842
 843 @cindex formatting file contents
 844
 845 These commands reformat the contents of files.
 846
 847 @menu
 848 * fmt invocation::              Reformat paragraph text.
 849 * pr invocation::               Paginate or columnate files for printing.
 850 * fold invocation::             Wrap input lines to fit in specified width.
 851 @end menu
 852
 853
 854 @node fmt invocation
 855 @section @code{fmt}: Reformat paragraph text
 856
 857 @pindex fmt
 858 @cindex reformatting paragraph text
 859 @cindex paragraphs, reformatting
 860 @cindex text, reformatting
 861
 862 @code{fmt} fills and joins lines to produce output lines of (at most)
 863 a given number of characters (75 by default).  Synopsis:
 864
 865 @example
 866 fmt [@var{option}]@dots{} [@var{file}]@dots{}
 867 @end example
 868
 869 @code{fmt} reads from the specified @var{file} arguments (or standard
 870 input if none are given), and writes to standard output.
 871
 872 By default, blank lines, spaces between words, and indentation are
 873 preserved in the output; successive input lines with different
 874 indentation are not joined; tabs are expanded on input and introduced on
 875 output.
 876
 877 @cindex line-breaking
 878 @cindex sentences and line-breaking
 879 @cindex Knuth, Donald E.
 880 @cindex Plass, Michael F.
 881 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
 882 avoid line breaks after the first word of a sentence or before the last
 883 word of a sentence.  A @dfn{sentence break} is defined as either the end
 884 of a paragraph or a word ending in any of @samp{.?!}, followed by two
 885 spaces or end of line, ignoring any intervening parentheses or quotes.
 886 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
 887 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
 888 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
 889 and Experience}, 11 (1981), 1119--1184).
 890
 891 The program accepts the following options.  Also see @ref{Common options}.
 892
 893 @table @samp
 894
 895 @item -c
 896 @itemx --crown-margin
 897 @opindex -c
 898 @opindex --crown-margin
 899 @cindex crown margin
 900 @dfn{Crown margin} mode: preserve the indentation of the first two
 901 lines within a paragraph, and align the left margin of each subsequent
 902 line with that of the second line.
 903
 904 @item -t
 905 @itemx --tagged-paragraph
 906 @opindex -t
 907 @opindex --tagged-paragraph
 908 @cindex tagged paragraphs
 909 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
 910 indentation of the first line of a paragraph is the same as the
 911 indentation of the second, the first line is treated as a one-line
 912 paragraph.
 913
 914 @item -s
 915 @itemx --split-only
 916 @opindex -s
 917 @opindex --split-only
 918 Split lines only.  Do not join short lines to form longer ones.  This
 919 prevents sample lines of code, and other such ``formatted'' text from
 920 being unduly combined.
 921
 922 @item -u
 923 @itemx --uniform-spacing
 924 @opindex -u
 925 @opindex --uniform-spacing
 926 Uniform spacing.  Reduce spacing between words to one space, and spacing
 927 between sentences to two spaces.
 928
 929 @item -@var{width}
 930 @itemx -w @var{width}
 931 @itemx --width=@var{width}
 932 @opindex -@var{width}
 933 @opindex -w
 934 @opindex --width
 935 Fill output lines up to @var{width} characters (default 75).  @code{fmt}
 936 initially tries to make lines about 7% shorter than this, to give it
 937 room to balance line lengths.
 938
 939 @item -p @var{prefix}
 940 @itemx --prefix=@var{prefix}
 941 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
 942 are subject to formatting. The prefix and any preceding whitespace are
 943 stripped for the formatting and then re-attached to each formatted output
 944 line.  One use is to format certain kinds of program comments, while
 945 leaving the code unchanged.
 946
 947 @end table
 948
 949
 950 @node pr invocation
 951 @section @code{pr}: Paginate or columnate files for printing
 952
 953 @pindex pr
 954 @cindex printing, preparing files for
 955 @cindex multicolumn output, generating
 956 @cindex merging files in parallel
 957
 958 @code{pr} writes each @var{file} (@samp{-} means standard input), or
 959 standard input if none are given, to standard output, paginating and
 960 optionally outputting in multicolumn format; optionally merges all
 961 @var{file}s, printing all in parallel, one per column.  Synopsis:
 962
 963 @example
 964 pr [@var{option}]@dots{} [@var{file}]@dots{}
 965 @end example
 966
 967 By default, a 5-line header is printed at each page: two blank lines;
 968 a line with the date, the filename, and the page count; and two more
 969 blank lines.  A footer of five blank lines is also printed.  With the @samp{-F}
 970 option, a 3-line header is printed: the leading two blank lines are
 971 omitted; no footer is used.  The default @var{page_length} in both cases is 66
 972 lines.  The default number of text lines changes from 56 (without @samp{-F})
 973 to 63 (with @samp{-F}).  The text line of the header takes up the full
 974 @var{page_width} in the form @samp{yyyy-mm-dd HH:MM string Page nnnn}.
 975 String is a centered header string.
 976
 977 Form feeds in the input cause page breaks in the output.  Multiple form
 978 feeds produce empty pages.
 979
 980 Columns are of equal width, separated by an optional string (default
 981 is @samp{space}).  For multicolumn output, lines will always be truncated to
 982 @var{page_width} (default 72), unless you use the @samp{-J} option.  For single
 983 column output no line truncation occurs by default.  Use @samp{-W} option to
 984 truncate lines in that case.
 985
 986    Including version 1.22i:
 987
 988 Some small @var{letter options} (@samp{-s}, @samp{-w}) has been redefined
 989 with the object of a better @var{posix} compliance.  The output of some
 990 further cases has been adapted to other @var{unix}es.  A violation of
 991 downward compatibility has to be accepted.
 992
 993 Some @var{new capital letter} options (@samp{-J}, @samp{-S}, @samp{-W})
 994 has been introduced to turn off unexpected interferences of small letter
 995 options.  The @samp{-N} option and the second argument @var{last_page}
 996 of @samp{+FIRST_PAGE} offer more flexibility.  The detailed handling of
 997 form feeds set in the input files requires @samp{-T} option.
 998
 999 Capital letter options dominate small letter ones.
1000
1001 Some of the option-arguments (compare @samp{-s}, @samp{-S}, @samp{-e},
1002 @samp{-i}, @samp{-n}) cannot be specified as separate arguments from the
1003 preceding option letter (already stated in the @var{posix} specification).
1004
1005 The program accepts the following options.  Also see @ref{Common options}.
1006
1007 @table @samp
1008
1009 @item +@var{first_page}[:@var{last_page}]
1010 @itemx --pages=@var{first_page}[:@var{last_page}]
1011 @opindex +@var{first_page}[:@var{last_page}]
1012 @opindex --pages
1013 Begin printing with page @var{first_page} and stop with @var{last_page}.
1014 Missing @samp{:@var{last_page}} implies end of file.  While estimating
1015 the number of skipped pages each form feed in the input file results
1016 in a new page.  Page counting with and without @samp{+@var{first_page}}
1017 is identical.  By default, counting starts with the first page of input
1018 file (not first page printed).  Line numbering may be altered by @samp{-N}
1019 option.
1020
1021 @item -@var{column}
1022 @itemx --columns=@var{column}
1023 @opindex -@var{column}
1024 @opindex --columns
1025 @cindex down columns
1026 With each single @var{file}, produce @var{column} columns of output
1027 (default is 1) and print columns down, unless @samp{-a} is used.  The
1028 column width is automatically decreased as @var{column} increases; unless
1029 you use the @samp{-W/-w} option to increase @var{page_width} as well.
1030 This option might well cause some lines to be truncated.  The number of
1031 lines in the columns on each page are balanced.  The options @samp{-e}
1032 and @samp{-i} are on for multiple text-column output.  Together with
1033 @samp{-J} option column alignment and line truncation is turned off.
1034 Lines of full length are joined in a free field format and @samp{-S}
1035 option may set field separators.  @samp{-@var{column}} may not be used
1036 with @samp{-m} option.
1037
1038 @item -a
1039 @itemx --across
1040 @opindex -a
1041 @opindex --across
1042 @cindex across columns
1043 With each single @var{file}, print columns across rather than down.  The
1044 @samp{-@var{column}} option must be given with @var{column} greater than one.
1045 If a line is too long to fit in a column, it is truncated.
1046
1047 @item -c
1048 @itemx --show-control-chars
1049 @opindex -c
1050 @opindex --show-control-chars
1051 Print control characters using hat notation (e.g., @samp{^G}); print
1052 other unprintable characters in octal backslash notation.  By default,
1053 unprintable characters are not changed.
1054
1055 @item -d
1056 @itemx --double-space
1057 @opindex -d
1058 @opindex --double-space
1059 @cindex double spacing
1060 Double space the output.
1061
1062 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
1063 @itemx --expand-tabs[=@var{in-tabchar}[@var{in-tabwidth}]]
1064 @opindex -e
1065 @opindex --expand-tabs
1066 @cindex input tabs
1067 Expand @var{tab}s to spaces on input.  Optional argument @var{in-tabchar} is
1068 the input tab character (default is the TAB character).  Second optional
1069 argument @var{in-tabwidth} is the input tab character's width (default
1070 is 8).
1071
1072 @item -f
1073 @itemx -F
1074 @itemx --form-feed
1075 @opindex -F
1076 @opindex -f
1077 @opindex --form-feed
1078 Use a form feed instead of newlines to separate output pages.  The default
1079 page length of 66 lines is not altered.  But the number of lines of text
1080 per page changes from default 56 to 63 lines.
1081
1082 @item -h @var{HEADER}
1083 @itemx --header=@var{HEADER}
1084 @opindex -h
1085 @opindex --header
1086 Replace the filename in the header with the centered string @var{header}.
1087 Left-hand-side truncation (marked by a @samp{*}) may occur if the total
1088 header line @samp{yyyy-mm-dd HH:MM HEADER Page nnnn} becomes larger than
1089 @var{page_width}.  @samp{-h ""} prints a blank line header.  Don't use
1090 @samp{-h""}.
1091 A space between the @samp{-h} option and the argument is always
1092 indispensable.
1093
1094 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
1095 @itemx --output-tabs[=@var{out-tabchar}[@var{out-tabwidth}]]
1096 @opindex -i
1097 @opindex --output-tabs
1098 @cindex output tabs
1099 Replace spaces with @var{tab}s on output.  Optional argument @var{out-tabchar}
1100 is the output tab character (default is the TAB character).  Second optional
1101 argument @var{out-tabwidth} is the output tab character's width (default
1102 is 8).
1103
1104 @item -J
1105 @itemx --join-lines
1106 @opindex -J
1107 @opindex --join-lines
1108 Merge lines of full length.  Used together with the column options
1109 @samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}.  Turns off
1110 @samp{-W/-w} line truncation;
1111 no column alignment used; may be used with @samp{-S[@var{string}]}.
1112 @samp{-J} has been introduced (together with @samp{-W} and @samp{-S})
1113 to disentangle the old (@var{posix} compliant) options @samp{-w} and
1114 @samp{-s} along with the three column options.
1115
1116
1117 @item -l @var{page_length}
1118 @itemx --length=@var{page_length}
1119 @opindex -l
1120 @opindex --length
1121 Set the page length to @var{page_length} (default 66) lines, including
1122 the lines of the header [and the footer].  If @var{page_length} is less
1123 than or equal 10 (and <= 3 with @samp{-F}), the header and footer are
1124 omitted, and all form feeds set in input files are eliminated, as if
1125 the @samp{-T} option had been given.
1126
1127 @item -m
1128 @itemx --merge
1129 @opindex -m
1130 @opindex --merge
1131 Merge and print all @var{file}s in parallel, one in each column.  If a
1132 line is too long to fit in a column, it is truncated, unless @samp{-J}
1133 option is used.  @samp{-S[@var{string}]} may be used.  Empty pages in
1134 some @var{file}s (form feeds set) produce empty columns, still marked
1135 by @var{string}.  The result is a continuous line numbering and column
1136 marking throughout the whole merged file.  Completely empty merged pages
1137 show no separators or line numbers.  The default header becomes
1138 @samp{yyyy-mm-dd HH:MM <blanks> Page nnnn}; may be used with
1139 @samp{-h @var{header}} to fill up the middle blank part.
1140
1141 @item -n[@var{number-separator}[@var{digits}]]
1142 @itemx --number-lines[=@var{number-separator}[@var{digits}]]
1143 @opindex -n
1144 @opindex --number-lines
1145 Provide @var{digits} digit line numbering (default for @var{digits} is
1146 5).  With multicolumn output the number occupies the first @var{digits}
1147 column positions of each text column or only each line of @samp{-m}
1148 output.  With single column output the number precedes each line just as
1149 @samp{-m} does.  Default counting of the line numbers starts with 1st
1150 line of the input file (not the 1st line printed, compare the
1151 @samp{--page} option and @samp{-N} option).
1152 Optional argument @var{number-separator} is the character appended to
1153 the line number to separate it from the text followed.  The default
1154 separator is the TAB character.  In a strict sense a TAB is always
1155 printed with single column output only.  The @var{TAB}-width varies
1156 with the @var{TAB}-position, e.g. with the left @var{margin} specified
1157 by @samp{-o} option.  With multicolumn output priority is given to
1158 @samp{equal width of output columns} (a @var{posix} specification).
1159 The @var{TAB}-width is fixed to the value of the 1st column and does
1160 not change with different values of left @var{margin}.  That means a
1161 fixed number of spaces is always printed in the place of the
1162 @var{number-separator tab}.  The tabification depends upon the output
1163 position.
1164
1165 @item -N @var{line_number}
1166 @itemx --first-line-number=@var{line_number}
1167 @opindex -N
1168 @opindex --first-line-number
1169 Start line counting with the number @var{line_number} at first line of
1170 first page printed (in most cases not the first line of the input file).
1171
1172 @item -o @var{margin}
1173 @itemx --indent=@var{margin}
1174 @opindex -o
1175 @opindex --indent
1176 @cindex indenting lines
1177 @cindex left margin
1178 Indent each line with a margin @var{margin} spaces wide (default is zero).
1179 The total page width is the size of the margin plus the @var{page_width}
1180 set with the @samp{-W/-w} option.  A limited overflow may occur with
1181 numbered single column output (compare @samp{-n} option).
1182
1183 @item -r
1184 @itemx --no-file-warnings
1185 @opindex -r
1186 @opindex --no-file-warnings
1187 Do not print a warning message when an argument @var{file} cannot be
1188 opened.  (The exit status will still be nonzero, however.)
1189
1190 @item -s[@var{char}]
1191 @itemx --separator[=@var{char}]
1192 @opindex -s
1193 @opindex --separator
1194 Separate columns by a single character @var{char}.  Default for @var{char}
1195 is the TAB character without @samp{-w} and @samp{no character} with
1196 @samp{-w}.  Without @samp{-s} default separator @samp{space} is set.
1197 @samp{-s[char]} turns off line truncation of all three column options
1198 (@samp{-COLUMN}|@samp{-a -COLUMN}|@samp{-m}) except @samp{-w} is set.
1199 That is a @var{posix} compliant formulation.
1200
1201
1202 @item -S[@var{string}]
1203 @itemx --sep-string[=@var{string}]
1204 @opindex -S
1205 @opindex --sep-string
1206 Use @var{string} to separate output columns.  The @samp{-S} option doesn't
1207 affect the @samp{-W/-w} option, unlike the @samp{-s} option which does.  It
1208 does not affect line truncation or column alignment.
1209 Without @samp{-S}, and with @samp{-J}, @code{pr} uses the default output
1210 separator, TAB.
1211 Without @samp{-S} or @samp{-J}, @code{pr} uses a @samp{space}
1212 (same as @samp{-S" "}).
1213 Using @samp{-S} with no @var{string} is equivalent to @samp{-S""}.
1214 Note that for some of @code{pr}'s options the single-letter option
1215 character must be followed immediately by any corresponding argument;
1216 there may not be any intervening white space.
1217 @samp{-S/-s} is one of them.  Don't use @samp{-S "STRING"}.
1218 @sc{posix} requires this.
1219
1220 @item -t
1221 @itemx --omit-header
1222 @opindex -t
1223 @opindex --omit-header
1224 Do not print the usual header [and footer] on each page, and do not fill
1225 out the bottom of pages (with blank lines or a form feed).  No page
1226 structure is produced, but form feeds set in the input files are retained.
1227 The predefined pagination is not changed.  @samp{-t} or @samp{-T} may be
1228 useful together with other options; e.g.: @samp{-t -e4}, expand TAB characters
1229 in the input file to 4 spaces but don't make any other changes.  Use of
1230 @samp{-t} overrides @samp{-h}.
1231
1232 @item -T
1233 @itemx --omit-pagination
1234 @opindex -T
1235 @opindex --omit-pagination
1236 Do not print header [and footer].  In addition eliminate all form feeds
1237 set in the input files.
1238
1239 @item -v
1240 @itemx --show-nonprinting
1241 @opindex -v
1242 @opindex --show-nonprinting
1243 Print unprintable characters in octal backslash notation.
1244
1245 @item -w @var{page_width}
1246 @itemx --width=@var{page_width}
1247 @opindex -w
1248 @opindex --width
1249 Set page width to @var{page_width} characters for multiple text-column
1250 output only (default for @var{page_width} is 72).  @samp{-s[CHAR]} turns
1251 off the default page width and any line truncation and column alignment.
1252 Lines of full length are merged, regardless of the column options
1253 set.  No @var{page_width} setting is possible with single column output.
1254 A @var{posix} compliant formulation.
1255
1256 @item -W @var{page_width}
1257 @itemx --page_width=@var{page_width}
1258 @opindex -W
1259 @opindex --page_width
1260 Set the page width to @var{page_width} characters.  That's valid with and
1261 without a column option.  Text lines are truncated, unless @samp{-J}
1262 is used.  Together with one of the three column options
1263 (@samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}) column
1264 alignment is always used.  The separator options @samp{-S} or @samp{-s}
1265 don't affect the @samp{-W} option.  Default is 72 characters.  Without
1266 @samp{-W @var{page_width}} and without any of the column options NO line
1267 truncation is used (defined to keep downward compatibility and to meet
1268 most frequent tasks).  That's equivalent to @samp{-W 72 -J}.  With and
1269 without @samp{-W @var{page_width}} the header line is always truncated
1270 to avoid line overflow.
1271
1272 @end table
1273
1274
1275 @node fold invocation
1276 @section @code{fold}: Wrap input lines to fit in specified width
1277
1278 @pindex fold
1279 @cindex wrapping long input lines
1280 @cindex folding long input lines
1281
1282 @code{fold} writes each @var{file} (@samp{-} means standard input), or
1283 standard input if none are given, to standard output, breaking long
1284 lines.  Synopsis:
1285
1286 @example
1287 fold [@var{option}]@dots{} [@var{file}]@dots{}
1288 @end example
1289
1290 By default, @code{fold} breaks lines wider than 80 columns.  The output
1291 is split into as many lines as necessary.
1292
1293 @cindex screen columns
1294 @code{fold} counts screen columns by default; thus, a tab may count more
1295 than one column, backspace decreases the column count, and carriage
1296 return sets the column to zero.
1297
1298 The program accepts the following options.  Also see @ref{Common options}.
1299
1300 @table @samp
1301
1302 @item -b
1303 @itemx --bytes
1304 @opindex -b
1305 @opindex --bytes
1306 Count bytes rather than columns, so that tabs, backspaces, and carriage
1307 returns are each counted as taking up one column, just like other
1308 characters.
1309
1310 @item -s
1311 @itemx --spaces
1312 @opindex -s
1313 @opindex --spaces
1314 Break at word boundaries: the line is broken after the last blank before
1315 the maximum line length.  If the line contains no such blanks, the line
1316 is broken at the maximum line length as usual.
1317
1318 @item -w @var{width}
1319 @itemx --width=@var{width}
1320 @opindex -w
1321 @opindex --width
1322 Use a maximum line length of @var{width} columns instead of 80.
1323
1324 @end table
1325
1326
1327 @node Output of parts of files
1328 @chapter Output of parts of files
1329
1330 @cindex output of parts of files
1331 @cindex parts of files, output of
1332
1333 These commands output pieces of the input.
1334
1335 @menu
1336 * head invocation::             Output the first part of files.
1337 * tail invocation::             Output the last part of files.
1338 * split invocation::            Split a file into fixed-size pieces.
1339 * csplit invocation::           Split a file into context-determined pieces.
1340 @end menu
1341
1342 @node head invocation
1343 @section @code{head}: Output the first part of files
1344
1345 @pindex head
1346 @cindex initial part of files, outputting
1347 @cindex first part of files, outputting
1348
1349 @code{head} prints the first part (10 lines by default) of each
1350 @var{file}; it reads from standard input if no files are given or
1351 when given a @var{file} of @samp{-}.  Synopses:
1352
1353 @example
1354 head [@var{option}]@dots{} [@var{file}]@dots{}
1355 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1356 @end example
1357
1358 If more than one @var{file} is specified, @code{head} prints a
1359 one-line header consisting of
1360 @example
1361 ==> @var{file name} <==
1362 @end example
1363 @noindent
1364 before the output for each @var{file}.
1365
1366 @code{head} accepts two option formats: the new one, in which numbers
1367 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1368 the number precedes any option letters (@samp{-1q}).
1369
1370 The program accepts the following options.  Also see @ref{Common options}.
1371
1372 @table @samp
1373
1374 @item -@var{count}@var{options}
1375 @opindex -@var{count}
1376 This option is only recognized if it is specified first.  @var{count} is
1377 a decimal number optionally followed by a size letter (@samp{b},
1378 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1379 or other option letters (@samp{cqv}).
1380
1381 @item -c @var{bytes}
1382 @itemx --bytes=@var{bytes}
1383 @opindex -c
1384 @opindex --bytes
1385 Print the first @var{bytes} bytes, instead of initial lines.  Appending
1386 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1387 by 1048576.
1388
1389 @itemx -n @var{n}
1390 @itemx --lines=@var{n}
1391 @opindex -n
1392 @opindex --lines
1393 Output the first @var{n} lines.
1394
1395 @item -q
1396 @itemx --quiet
1397 @itemx --silent
1398 @opindex -q
1399 @opindex --quiet
1400 @opindex --silent
1401 Never print file name headers.
1402
1403 @item -v
1404 @itemx --verbose
1405 @opindex -v
1406 @opindex --verbose
1407 Always print file name headers.
1408
1409 @end table
1410
1411
1412 @node tail invocation
1413 @section @code{tail}: Output the last part of files
1414
1415 @pindex tail
1416 @cindex last part of files, outputting
1417
1418 @code{tail} prints the last part (10 lines by default) of each
1419 @var{file}; it reads from standard input if no files are given or
1420 when given a @var{file} of @samp{-}.  Synopses:
1421
1422 @example
1423 tail [@var{option}]@dots{} [@var{file}]@dots{}
1424 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1425 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1426 @end example
1427
1428 If more than one @var{file} is specified, @code{tail} prints a
1429 one-line header consisting of
1430 @example
1431 ==> @var{file name} <==
1432 @end example
1433 @noindent
1434 before the output for each @var{file}.
1435
1436 @cindex BSD @code{tail}
1437 GNU @code{tail} can output any amount of data (some other versions of
1438 @code{tail} cannot).  It also has no @samp{-r} option (print in
1439 reverse), since reversing a file is really a different job from printing
1440 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1441 only reverse files that are at most as large as its buffer, which is
1442 typically 32k.  A more reliable and versatile way to reverse files is
1443 the GNU @code{tac} command.
1444
1445 @code{tail} accepts two option formats: the new one, in which numbers
1446 are arguments to the options (@samp{-n 1}), and the old one, in which
1447 the number precedes any option letters (@samp{-1} or @samp{+1}).
1448
1449 If any option-argument is a number @var{n} starting with a @samp{+},
1450 @code{tail} begins printing with the @var{n}th item from the start of
1451 each file, instead of from the end.
1452
1453 The program accepts the following options.  Also see @ref{Common options}.
1454
1455 @table @samp
1456
1457 @item -@var{count}
1458 @itemx +@var{count}
1459 @opindex -@var{count}
1460 @opindex +@var{count}
1461 This option is only recognized if it is specified first.  @var{count} is
1462 a decimal number optionally followed by a size letter (@samp{b},
1463 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1464 or other option letters (@samp{cfqv}).
1465
1466 @item -c @var{bytes}
1467 @itemx --bytes=@var{bytes}
1468 @opindex -c
1469 @opindex --bytes
1470 Output the last @var{bytes} bytes, instead of final lines.  Appending
1471 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1472 by 1048576.
1473
1474 @item -f
1475 @itemx --follow[=@var{how}]
1476 @opindex -f
1477 @opindex --follow
1478 @cindex growing files
1479 @vindex name @r{follow option}
1480 @vindex descriptor @r{follow option}
1481 Loop forever trying to read more characters at the end of the file,
1482 presumably because the file is growing.  This option is ignored when
1483 reading from a pipe.
1484 If more than one file is given, @code{tail} prints a header whenever it
1485 gets output from a different file, to indicate which file that output is
1486 from.
1487
1488 There are two ways to specify how you'd like to track files with this option,
1489 but that difference is noticeable only when a followed file is removed or
1490 renamed.
1491 If you'd like to continue to track the end of a growing file even after
1492 it has been unlinked, use @samp{--follow=descriptor}.  This is the default
1493 behavior, but it is not useful if you're tracking a log file that may be
1494 rotated (removed or renamed, then reopened).  In that case, use
1495 @samp{--follow=name} to track the named file by reopening it periodically
1496 to see if it has been removed and recreated by some other program.
1497
1498 No matter which method you use, if the tracked file is determined to have
1499 shrunk, @code{tail} prints a message saying the file has been truncated
1500 and resumes tracking the end of the file from the newly-determined endpoint.
1501
1502 When a file is removed, @code{tail}'s behavior depends on whether it is
1503 following the name or the descriptor.  When following by name, tail can
1504 detect that a file has been removed and gives a message to that effect,
1505 and if @samp{--retry} has been specified it will continue checking
1506 periodically to see if the file reappears.
1507 When following a descriptor, tail does not detect that the file has
1508 been unlinked or renamed and issues no message;  even though the file
1509 may no longer be accessible via its original name, it may still be
1510 growing.
1511
1512 The option values @samp{descriptor} and @samp{name} may be specified only
1513 with the long form of the option, not with @samp{-f}.
1514
1515 @itemx --retry
1516 @opindex --retry
1517 This option is meaningful only when following by name.
1518 Without this option, when tail encounters a file that doesn't
1519 exist or is otherwise inaccessible, it reports that fact and
1520 never checks it again.
1521
1522 @itemx --sleep-interval=@var{n}
1523 @opindex --sleep-interval
1524 Change the number of seconds to wait between iterations (the default is 1).
1525 During one iteration, every specified file is checked to see if it has
1526 changed size.
1527
1528 @itemx --pid=@var{pid}
1529 @opindex --pid
1530 When following by name or by descriptor, you may specify the process ID,
1531 @var{pid}, of the sole writer of all @var{file} arguments.  Then, shortly
1532 after that process terminates, tail will also terminate.  This will
1533 work properly only if the writer and the tailing process are running on
1534 the same machine.  For example, to save the output of a build in a file
1535 and to watch the file grow, if you invoke @code{make} and @code{tail}
1536 like this then the tail process will stop when your build completes.
1537 Without this option, you would have had to kill the @code{tail -f}
1538 process yourself.
1539 @example
1540 $ make >& makerr & tail --pid=$! -f makerr
1541 @end example
1542 If you specify a @var{pid} that is not in use or that does not correspond
1543 to the process that is writing to the tailed files, then @code{tail}
1544 may terminate long before any @var{file}s stop growing or it may not
1545 terminate until long after the real writer has terminated.
1546
1547 @itemx --max-consecutive-size-changes=@var{n}
1548 @opindex --max-consecutive-size-changes
1549 This option is meaningful only when following by name.
1550 Use it to control how long @code{tail} follows the descriptor of a file
1551 that continues growing at a rapid pace even after it is deleted or renamed.
1552 After detecting @var{n} consecutive size changes for a file,
1553 @code{open}/@code{fstat} the file to determine if that file name is
1554 still associated with the same device/inode-number pair as before.
1555 See the output of @code{tail --help} for the default value.
1556
1557 @itemx --max-unchanged-stats=@var{n}
1558 @opindex --max-unchanged-stats
1559 When tailing a file by name, if there have been this many consecutive
1560 iterations for which the size has remained the same, then
1561 @code{open}/@code{fstat} the file to determine if that file name is
1562 still associated with the same device/inode-number pair as before.
1563 When following a log file that is rotated this is approximately the
1564 number of seconds between when tail prints the last pre-rotation lines
1565 and when it prints the lines that have accumulated in the new log file.
1566 See the output of @code{tail --help} for the default value.
1567 This option is meaningful only when following by name.
1568
1569 @itemx -n @var{n}
1570 @itemx --lines=@var{n}
1571 @opindex -n
1572 @opindex --lines
1573 Output the last @var{n} lines.
1574
1575 @item -q
1576 @itemx -quiet
1577 @itemx --silent
1578 @opindex -q
1579 @opindex --quiet
1580 @opindex --silent
1581 Never print file name headers.
1582
1583 @item -v
1584 @itemx --verbose
1585 @opindex -v
1586 @opindex --verbose
1587 Always print file name headers.
1588
1589 @end table
1590
1591
1592 @node split invocation
1593 @section @code{split}: Split a file into fixed-size pieces
1594
1595 @pindex split
1596 @cindex splitting a file into pieces
1597 @cindex pieces, splitting a file into
1598
1599 @code{split} creates output files containing consecutive sections of
1600 @var{input} (standard input if none is given or @var{input} is
1601 @samp{-}).  Synopsis:
1602
1603 @example
1604 split [@var{option}] [@var{input} [@var{prefix}]]
1605 @end example
1606
1607 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1608 left over for the last section), into each output file.
1609
1610 @cindex output file name prefix
1611 The output files' names consist of @var{prefix} (@samp{x} by default)
1612 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1613 that concatenating the output files in sorted order by file name produces
1614 the original input file.  (If more than 676 output files are required,
1615 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1616
1617 The program accepts the following options.  Also see @ref{Common options}.
1618
1619 @table @samp
1620
1621 @item -@var{lines}
1622 @itemx -l @var{lines}
1623 @itemx --lines=@var{lines}
1624 @opindex -l
1625 @opindex --lines
1626 Put @var{lines} lines of @var{input} into each output file.
1627
1628 @item -b @var{bytes}
1629 @itemx --bytes=@var{bytes}
1630 @opindex -b
1631 @opindex --bytes
1632 Put the first @var{bytes} bytes of @var{input} into each output file.
1633 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1634 @samp{m} by 1048576.
1635
1636 @item -C @var{bytes}
1637 @itemx --line-bytes=@var{bytes}
1638 @opindex -C
1639 @opindex --line-bytes
1640 Put into each output file as many complete lines of @var{input} as
1641 possible without exceeding @var{bytes} bytes.  For lines longer than
1642 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1643 less than @var{bytes} bytes of the line are left, then continue
1644 normally.  @var{bytes} has the same format as for the @samp{--bytes}
1645 option.
1646
1647 @itemx --verbose
1648 @opindex --verbose
1649 Write a diagnostic to standard error just before each output file is opened.
1650
1651 @end table
1652
1653
1654 @node csplit invocation
1655 @section @code{csplit}: Split a file into context-determined pieces
1656
1657 @pindex csplit
1658 @cindex context splitting
1659 @cindex splitting a file into pieces by context
1660
1661 @code{csplit} creates zero or more output files containing sections of
1662 @var{input} (standard input if @var{input} is @samp{-}).  Synopsis:
1663
1664 @example
1665 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1666 @end example
1667
1668 The contents of the output files are determined by the @var{pattern}
1669 arguments, as detailed below.  An error occurs if a @var{pattern}
1670 argument refers to a nonexistent line of the input file (e.g., if no
1671 remaining line matches a given regular expression).  After every
1672 @var{pattern} has been matched, any remaining input is copied into one
1673 last output file.
1674
1675 By default, @code{csplit} prints the number of bytes written to each
1676 output file after it has been created.
1677
1678 The types of pattern arguments are:
1679
1680 @table @samp
1681
1682 @item @var{n}
1683 Create an output file containing the input up to but not including line
1684 @var{n} (a positive integer).  If followed by a repeat count, also
1685 create an output file containing the next @var{line} lines of the input
1686 file once for each repeat.
1687
1688 @item /@var{regexp}/[@var{offset}]
1689 Create an output file containing the current line up to (but not
1690 including) the next line of the input file that contains a match for
1691 @var{regexp}.  The optional @var{offset} is a @samp{+} or @samp{-}
1692 followed by a positive integer.  If it is given, the input up to the
1693 matching line plus or minus @var{offset} is put into the output file,
1694 and the line after that begins the next section of input.
1695
1696 @item %@var{regexp}%[@var{offset}]
1697 Like the previous type, except that it does not create an output
1698 file, so that section of the input file is effectively ignored.
1699
1700 @item @{@var{repeat-count}@}
1701 Repeat the previous pattern @var{repeat-count} additional
1702 times. @var{repeat-count} can either be a positive integer or an
1703 asterisk, meaning repeat as many times as necessary until the input is
1704 exhausted.
1705
1706 @end table
1707
1708 The output files' names consist of a prefix (@samp{xx} by default)
1709 followed by a suffix.  By default, the suffix is an ascending sequence
1710 of two-digit decimal numbers from @samp{00} and up to @samp{99}.  In any
1711 case, concatenating the output files in sorted order by filename
1712 produces the original input file.
1713
1714 By default, if @code{csplit} encounters an error or receives a hangup,
1715 interrupt, quit, or terminate signal, it removes any output files
1716 that it has created so far before it exits.
1717
1718 The program accepts the following options.  Also see @ref{Common options}.
1719
1720 @table @samp
1721
1722 @item -f @var{prefix}
1723 @itemx --prefix=@var{prefix}
1724 @opindex -f
1725 @opindex --prefix
1726 @cindex output file name prefix
1727 Use @var{prefix} as the output file name prefix.
1728
1729 @item -b @var{suffix}
1730 @itemx --suffix=@var{suffix}
1731 @opindex -b
1732 @opindex --suffix
1733 @cindex output file name suffix
1734 Use @var{suffix} as the output file name suffix.  When this option is
1735 specified, the suffix string must include exactly one
1736 @code{printf(3)}-style conversion specification, possibly including
1737 format specification flags, a field width, a precision specifications,
1738 or all of these kinds of modifiers.  The format letter must convert a
1739 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1740 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed.  The
1741 entire @var{suffix} is given (with the current output file number) to
1742 @code{sprintf(3)} to form the file name suffixes for each of the
1743 individual output files in turn.  If this option is used, the
1744 @samp{--digits} option is ignored.
1745
1746 @item -n @var{digits}
1747 @itemx --digits=@var{digits}
1748 @opindex -n
1749 @opindex --digits
1750 Use output file names containing numbers that are @var{digits} digits
1751 long instead of the default 2.
1752
1753 @item -k
1754 @itemx --keep-files
1755 @opindex -k
1756 @opindex --keep-files
1757 Do not remove output files when errors are encountered.
1758
1759 @item -z
1760 @itemx --elide-empty-files
1761 @opindex -z
1762 @opindex --elide-empty-files
1763 Suppress the generation of zero-length output files.  (In cases where
1764 the section delimiters of the input file are supposed to mark the first
1765 lines of each of the sections, the first output file will generally be a
1766 zero-length file unless you use this option.)  The output file sequence
1767 numbers always run consecutively starting from 0, even when this option
1768 is specified.
1769
1770 @item -s
1771 @itemx -q
1772 @itemx --silent
1773 @itemx --quiet
1774 @opindex -s
1775 @opindex -q
1776 @opindex --silent
1777 @opindex --quiet
1778 Do not print counts of output file sizes.
1779
1780 @end table
1781
1782
1783 @node Summarizing files
1784 @chapter Summarizing files
1785
1786 @cindex summarizing files
1787
1788 These commands generate just a few numbers representing entire
1789 contents of files.
1790
1791 @menu
1792 * wc invocation::               Print byte, word, and line counts.
1793 * sum invocation::              Print checksum and block counts.
1794 * cksum invocation::            Print CRC checksum and byte counts.
1795 * md5sum invocation::           Print or check message-digests.
1796 @end menu
1797
1798
1799 @node wc invocation
1800 @section @code{wc}: Print byte, word, and line counts
1801
1802 @pindex wc
1803 @cindex byte count
1804 @cindex word count
1805 @cindex line count
1806
1807 @code{wc} counts the number of bytes, whitespace-separated words, and
1808 newlines in each given @var{file}, or standard input if none are given
1809 or for a @var{file} of @samp{-}.  Synopsis:
1810
1811 @example
1812 wc [@var{option}]@dots{} [@var{file}]@dots{}
1813 @end example
1814
1815 @cindex total counts
1816 @code{wc} prints one line of counts for each file, and if the file was
1817 given as an argument, it prints the file name following the counts.  If
1818 more than one @var{file} is given, @code{wc} prints a final line
1819 containing the cumulative counts, with the file name @file{total}.  The
1820 counts are printed in this order: newlines, words, bytes.
1821 By default, each count is output right-justified in a 7-byte field with
1822 one space between fields so that the numbers and file names line up nicely
1823 in columns.  However, POSIX requires that there be exactly one space
1824 separating columns.  You can make @code{wc} use the POSIX-mandated
1825 output format by setting the @env{POSIXLY_CORRECT} environment variable.
1826
1827 By default, @code{wc} prints all three counts.  Options can specify
1828 that only certain counts be printed.  Options do not undo others
1829 previously given, so
1830
1831 @example
1832 wc --bytes --words
1833 @end example
1834
1835 @noindent
1836 prints both the byte counts and the word counts.
1837
1838 With the @code{--max-line-length} option, @code{wc} prints the length
1839 of the longest line per file, and if there is more than one file it
1840 prints the maximum (not the sum) of those lengths.
1841
1842 The program accepts the following options.  Also see @ref{Common options}.
1843
1844 @table @samp
1845
1846 @item -c
1847 @itemx --bytes
1848 @itemx --chars
1849 @opindex -c
1850 @opindex --bytes
1851 @opindex --chars
1852 Print only the byte counts.
1853
1854 @item -w
1855 @itemx --words
1856 @opindex -w
1857 @opindex --words
1858 Print only the word counts.
1859
1860 @item -l
1861 @itemx --lines
1862 @opindex -l
1863 @opindex --lines
1864 Print only the newline counts.
1865
1866 @item -L
1867 @itemx --max-line-length
1868 @opindex -L
1869 @opindex --max-line-length
1870 Print only the maximum line lengths.
1871
1872 @end table
1873
1874
1875 @node sum invocation
1876 @section @code{sum}: Print checksum and block counts
1877
1878 @pindex sum
1879 @cindex 16-bit checksum
1880 @cindex checksum, 16-bit
1881
1882 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1883 standard input if none are given or for a @var{file} of @samp{-}.  Synopsis:
1884
1885 @example
1886 sum [@var{option}]@dots{} [@var{file}]@dots{}
1887 @end example
1888
1889 @code{sum} prints the checksum for each @var{file} followed by the
1890 number of blocks in the file (rounded up).  If more than one @var{file}
1891 is given, file names are also printed (by default).  (With the
1892 @samp{--sysv} option, corresponding file name are printed when there is
1893 at least one file argument.)
1894
1895 By default, GNU @code{sum} computes checksums using an algorithm
1896 compatible with BSD @code{sum} and prints file sizes in units of
1897 1024-byte blocks.
1898
1899 The program accepts the following options.  Also see @ref{Common options}.
1900
1901 @table @samp
1902
1903 @item -r
1904 @opindex -r
1905 @cindex BSD @code{sum}
1906 Use the default (BSD compatible) algorithm.  This option is included for
1907 compatibility with the System V @code{sum}.  Unless @samp{-s} was also
1908 given, it has no effect.
1909
1910 @item -s
1911 @itemx --sysv
1912 @opindex -s
1913 @opindex --sysv
1914 @cindex System V @code{sum}
1915 Compute checksums using an algorithm compatible with System V
1916 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1917
1918 @end table
1919
1920 @code{sum} is provided for compatibility; the @code{cksum} program (see
1921 next section) is preferable in new applications.
1922
1923
1924 @node cksum invocation
1925 @section @code{cksum}: Print CRC checksum and byte counts
1926
1927 @pindex cksum
1928 @cindex cyclic redundancy check
1929 @cindex CRC checksum
1930
1931 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1932 given @var{file}, or standard input if none are given or for a
1933 @var{file} of @samp{-}.  Synopsis:
1934
1935 @example
1936 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1937 @end example
1938
1939 @code{cksum} prints the CRC checksum for each file along with the number
1940 of bytes in the file, and the filename unless no arguments were given.
1941
1942 @code{cksum} is typically used to ensure that files
1943 transferred by unreliable means (e.g., netnews) have not been corrupted,
1944 by comparing the @code{cksum} output for the received files with the
1945 @code{cksum} output for the original files (typically given in the
1946 distribution).
1947
1948 The CRC algorithm is specified by the @sc{posix.2} standard.  It is not
1949 compatible with the BSD or System V @code{sum} algorithms (see the
1950 previous section); it is more robust.
1951
1952 The only options are @samp{--help} and @samp{--version}.  @xref{Common
1953 options}.
1954
1955
1956 @node md5sum invocation
1957 @section @code{md5sum}: Print or check message-digests
1958
1959 @pindex md5sum
1960 @cindex 128-bit checksum
1961 @cindex checksum, 128-bit
1962 @cindex fingerprint, 128-bit
1963 @cindex message-digest, 128-bit
1964
1965 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1966 @dfn{message-digest}) for each specified @var{file}.
1967 If a @var{file} is specified as @samp{-} or if no files are given
1968 @code{md5sum} computes the checksum for the standard input.
1969 @code{md5sum} can also determine whether a file and checksum are
1970 consistent. Synopses:
1971
1972 @example
1973 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1974 md5sum [@var{option}]@dots{} --check [@var{file}]
1975 @end example
1976
1977 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1978 indicating a binary or text input file, and the filename.
1979 If @var{file} is omitted or specified as @samp{-}, standard input is read.
1980
1981 The program accepts the following options.  Also see @ref{Common options}.
1982
1983 @table @samp
1984
1985 @item -b
1986 @itemx --binary
1987 @opindex -b
1988 @opindex --binary
1989 @cindex binary input files
1990 Treat all input files as binary.  This option has no effect on Unix
1991 systems, since they don't distinguish between binary and text files.
1992 This option is useful on systems that have different internal and
1993 external character representations.  On MS-DOS and MS-Windows, this is
1994 the default.
1995
1996 @item -c
1997 @itemx --check
1998 Read filenames and checksum information from the single @var{file}
1999 (or from stdin if no @var{file} was specified) and report whether
2000 each named file and the corresponding checksum data are consistent.
2001 The input to this mode of @code{md5sum} is usually the output of
2002 a prior, checksum-generating run of @samp{md5sum}.
2003 Each valid line of input consists of an MD5 checksum, a binary/text
2004 flag, and then a filename.
2005 Binary files are marked with @samp{*}, text with @samp{ }.
2006 For each such line, @code{md5sum} reads the named file and computes its
2007 MD5 checksum.  Then, if the computed message digest does not match the
2008 one on the line with the filename, the file is noted as having
2009 failed the test.  Otherwise, the file passes the test.
2010 By default, for each valid line, one line is written to standard
2011 output indicating whether the named file passed the test.
2012 After all checks have been performed, if there were any failures,
2013 a warning is issued to standard error.
2014 Use the @samp{--status} option to inhibit that output.
2015 If any listed file cannot be opened or read, if any valid line has
2016 an MD5 checksum inconsistent with the associated file, or if no valid
2017 line is found, @code{md5sum} exits with nonzero status.  Otherwise,
2018 it exits successfully.
2019
2020 @itemx --status
2021 @opindex --status
2022 @cindex verifying MD5 checksums
2023 This option is useful only when verifying checksums.
2024 When verifying checksums, don't generate the default one-line-per-file
2025 diagnostic and don't output the warning summarizing any failures.
2026 Failures to open or read a file still evoke individual diagnostics to
2027 standard error.
2028 If all listed files are readable and are consistent with the associated
2029 MD5 checksums, exit successfully.  Otherwise exit with a status code
2030 indicating there was a failure.
2031
2032 @item -t
2033 @itemx --text
2034 @opindex -t
2035 @opindex --text
2036 @cindex text input files
2037 Treat all input files as text files.  This is the reverse of
2038 @samp{--binary}.
2039
2040 @item -w
2041 @itemx --warn
2042 @opindex -w
2043 @opindex --warn
2044 @cindex verifying MD5 checksums
2045 When verifying checksums, warn about improperly formatted MD5 checksum lines.
2046 This option is useful only if all but a few lines in the checked input
2047 are valid.
2048
2049 @end table
2050
2051
2052 @node Operating on sorted files
2053 @chapter Operating on sorted files
2054
2055 @cindex operating on sorted files
2056 @cindex sorted files, operations on
2057
2058 These commands work with (or produce) sorted files.
2059
2060 @menu
2061 * sort invocation::             Sort text files.
2062 * uniq invocation::             Uniquify files.
2063 * comm invocation::             Compare two sorted files line by line.
2064 * ptx invocation::              Produce a permuted index of file contents.
2065 * tsort invocation::            Topological sort.
2066 @end menu
2067
2068
2069 @node sort invocation
2070 @section @code{sort}: Sort text files
2071
2072 @pindex sort
2073 @cindex sorting files
2074
2075 @code{sort} sorts, merges, or compares all the lines from the given
2076 files, or standard input if none are given or for a @var{file} of
2077 @samp{-}.  By default, @code{sort} writes the results to standard
2078 output.  Synopsis:
2079
2080 @example
2081 sort [@var{option}]@dots{} [@var{file}]@dots{}
2082 @end example
2083
2084 @code{sort} has three modes of operation: sort (the default), merge,
2085 and check for sortedness.  The following options change the operation
2086 mode:
2087
2088 @table @samp
2089
2090 @item -c
2091 @opindex -c
2092 @cindex checking for sortedness
2093 Check whether the given files are already sorted: if they are not all
2094 sorted, print an error message and exit with a status of 1.
2095 Otherwise, exit successfully.
2096
2097 @item -m
2098 @opindex -m
2099 @cindex merging sorted files
2100 Merge the given files by sorting them as a group.  Each input file must
2101 always be individually sorted.  It always works to sort instead of
2102 merge; merging is provided because it is faster, in the case where it
2103 works.
2104
2105 @end table
2106
2107 @vindex LC_COLLATE
2108 A pair of lines is compared as follows: if any key fields have been
2109 specified, @code{sort} compares each pair of fields, in the order
2110 specified on the command line, according to the associated ordering
2111 options, until a difference is found or no fields are left.
2112 Unless otherwise specified, all comparisons use the character
2113 collating sequence specified by the @env{LC_COLLATE} locale.
2114
2115 If any of the global options @samp{Mbdfinr} are given but no key fields
2116 are specified, @code{sort} compares the entire lines according to the
2117 global options.
2118
2119 Finally, as a last resort when all keys compare equal (or if no
2120 ordering options were specified at all), @code{sort} compares the entire
2121 lines.  The last resort comparison
2122 honors the @samp{-r} global option.  The @samp{-s} (stable) option
2123 disables this last-resort comparison so that lines in which all fields
2124 compare equal are left in their original relative order.  If no fields
2125 or global options are specified, @samp{-s} has no effect.
2126
2127 GNU @code{sort} (as specified for all GNU utilities) has no limits on
2128 input line length or restrictions on bytes allowed within lines.  In
2129 addition, if the final byte of an input file is not a newline, GNU
2130 @code{sort} silently supplies one.  A line's trailing newline is part of
2131 the line for comparison purposes; for example, with no options in an
2132 @sc{ascii} locale, a line starting with a tab sorts before an empty line
2133 because tab precedes newline in the @sc{ascii} collating sequence.
2134
2135 Upon any error, @code{sort} exits with a status of @samp{2}.
2136
2137 @vindex TMPDIR
2138 If the environment variable @env{TMPDIR} is set, @code{sort} uses its
2139 value as the directory for temporary files instead of @file{/tmp}.  The
2140 @samp{-T @var{tempdir}} option in turn overrides the environment
2141 variable.
2142
2143 @vindex LC_CTYPE
2144 The following options affect the ordering of output lines.  They may be
2145 specified globally or as part of a specific key field.  If no key
2146 fields are specified, global options apply to comparison of entire
2147 lines; otherwise the global options are inherited by key fields that do
2148 not specify any special options of their own.  The @samp{-b}, @samp{-d},
2149 @samp{-f} and @samp{-i} options classify characters according to
2150 the @env{LC_CTYPE} locale.
2151
2152 @table @samp
2153
2154 @item -b
2155 @opindex -b
2156 @cindex blanks, ignoring leading
2157 Ignore leading blanks when finding sort keys in each line.
2158
2159 @item -d
2160 @opindex -d
2161 @cindex phone directory order
2162 @cindex telephone directory order
2163 Sort in @dfn{phone directory} order: ignore all characters except
2164 letters, digits and blanks when sorting.
2165
2166 @item -f
2167 @opindex -f
2168 @cindex case folding
2169 Fold lowercase characters into the equivalent uppercase characters when
2170 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
2171
2172 @item -g
2173 @opindex -g
2174 @cindex general numeric sort
2175 Sort numerically, using the standard C function @code{strtod} to convert
2176 a prefix of each line to a double-precision floating point number.
2177 This allows floating point numbers to be specified in scientific notation,
2178 like @code{1.0e-34} and @code{10e100}.
2179 Do not report overflow, underflow, or conversion errors.
2180 Use the following collating sequence:
2181
2182 @itemize @bullet
2183 @item
2184 Lines that do not start with numbers (all considered to be equal).
2185 @item
2186 NaNs (``Not a Number'' values, in IEEE floating point arithmetic)
2187 in a consistent but machine-dependent order.
2188 @item
2189 Minus infinity.
2190 @item
2191 Finite numbers in ascending numeric order (with @math{-0} and @math{+0} equal).
2192 @item
2193 Plus infinity.
2194 @end itemize
2195
2196 Use this option only if there is no alternative; it is much slower than
2197 @samp{-n} and it can lose information when converting to floating point.
2198
2199 @item -i
2200 @opindex -i
2201 @cindex unprintable characters, ignoring
2202 Ignore unprintable characters.
2203
2204 @item -M
2205 @opindex -M
2206 @cindex months, sorting by
2207 @vindex LC_TIME
2208 An initial string, consisting of any amount of whitespace, followed
2209 by a month name abbreviation, is folded to UPPER case and
2210 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
2211 Invalid names compare low to valid names.  The @env{LC_TIME} locale
2212 determines the month spellings.
2213
2214 @item -n
2215 @opindex -n
2216 @cindex numeric sort
2217 @vindex LC_NUMERIC
2218 Sort numerically: the number begins each line; specifically, it consists
2219 of optional whitespace, an optional @samp{-} sign, and zero or more
2220 digits possibly separated by thousands separators, optionally followed
2221 by a radix character and zero or more digits.  The @env{LC_NUMERIC}
2222 locale specifies the radix character and thousands separator.
2223
2224 @code{sort -n} uses what might be considered an unconventional method
2225 to compare strings representing floating point numbers.  Rather than
2226 first converting each string to the C @code{double} type and then
2227 comparing those values, sort aligns the radix characters in the two
2228 strings and compares the strings a character at a time.  One benefit
2229 of using this approach is its speed.  In practice this is much more
2230 efficient than performing the two corresponding string-to-double (or even
2231 string-to-integer) conversions and then comparing doubles.  In addition,
2232 there is no corresponding loss of precision.  Converting each string to
2233 @code{double} before comparison would limit precision to about 16 digits
2234 on most systems.
2235
2236 Neither a leading @samp{+} nor exponential notation is recognized.
2237 To compare such strings numerically, use the @samp{-g} option.
2238
2239 @item -r
2240 @opindex -r
2241 @cindex reverse sorting
2242 Reverse the result of comparison, so that lines with greater key values
2243 appear earlier in the output instead of later.
2244
2245 @end table
2246
2247 Other options are:
2248
2249 @table @samp
2250
2251 @item -o @var{output-file}
2252 @opindex -o
2253 @cindex overwriting of input, allowed
2254 Write output to @var{output-file} instead of standard output.
2255 If @var{output-file} is one of the input files, @code{sort} copies
2256 it to a temporary file before sorting and writing the output to
2257 @var{output-file}.
2258
2259 @item -t @var{separator}
2260 @opindex -t
2261 @cindex field separator character
2262 Use character @var{separator} as the field separator when finding the
2263 sort keys in each line.  By default, fields are separated by the empty
2264 string between a non-whitespace character and a whitespace character.
2265 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
2266 into fields @w{@samp{ foo}} and @w{@samp{ bar}}.  The field separator is
2267 not considered to be part of either the field preceding or the field
2268 following.
2269
2270 @item -u
2271 @opindex -u
2272 @cindex uniquifying output
2273 For the default case or the @samp{-m} option, only output the first
2274 of a sequence of lines that compare equal.  For the @samp{-c} option,
2275 check that no pair of consecutive lines compares equal.
2276
2277 @item -k @var{pos1}[,@var{pos2}]
2278 @opindex -k
2279 @cindex sort field
2280 The recommended, @sc{posix}, option for specifying a sort field.  The field
2281 consists of the part of the line between @var{pos1} and @var{pos2} (or the
2282 end of the line, if @var{pos2} is omitted), @emph{inclusive}.
2283 Fields and character positions are numbered starting with 1.
2284 So to sort on the second field, you'd use @samp{-k 2,2}
2285 See below for more examples.
2286
2287 @item -z
2288 @opindex -z
2289 @cindex sort zero-terminated lines
2290 Treat the input as a set of lines, each terminated by a zero byte (@sc{ascii}
2291 @sc{nul} (Null) character) instead of an @sc{ascii} @sc{lf} (Line Feed).
2292 This option can be useful in conjunction with @samp{perl -0} or
2293 @samp{find -print0} and @samp{xargs -0} which do the same in order to
2294 reliably handle arbitrary pathnames (even those which contain Line Feed
2295 characters.)
2296
2297 @item +@var{pos1}[-@var{pos2}]
2298 The obsolete, traditional option for specifying a sort field.  The field
2299 consists of the line between @var{pos1} and up to but @emph{not including}
2300 @var{pos2} (or the end of the line if @var{pos2} is omitted).  Fields
2301 and character positions are numbered starting with 0.  See below.
2302
2303 @end table
2304
2305 In addition, when GNU @code{sort} is invoked with exactly one argument,
2306 options @samp{--help} and @samp{--version} are recognized.  @xref{Common
2307 options}.
2308
2309 Historical (BSD and System V) implementations of @code{sort} have
2310 differed in their interpretation of some options, particularly
2311 @samp{-b}, @samp{-f}, and @samp{-n}.  GNU sort follows the @sc{posix}
2312 behavior, which is usually (but not always!) like the System V behavior.
2313 According to @sc{posix}, @samp{-n} no longer implies @samp{-b}.  For
2314 consistency, @samp{-M} has been changed in the same way.  This may
2315 affect the meaning of character positions in field specifications in
2316 obscure cases.  The only fix is to add an explicit @samp{-b}.
2317
2318 A position in a sort field specified with the @samp{-k} or @samp{+}
2319 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
2320 of the field to use and @var{c} is the number of the first character
2321 from the beginning of the field (for @samp{+@var{pos}}) or from the end
2322 of the previous field (for @samp{-@var{pos}}).  If the @samp{.@var{c}}
2323 is omitted, it is taken to be the first character in the field.  If the
2324 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
2325 specification is counted from the first nonblank character of the field
2326 (for @samp{+@var{pos}}) or from the first nonblank character following
2327 the previous field (for @samp{-@var{pos}}).
2328
2329 A sort key option may also have any of the option letters @samp{Mbdfinr}
2330 appended to it, in which case the global ordering options are not used
2331 for that particular field.  The @samp{-b} option may be independently
2332 attached to either or both of the @samp{+@var{pos}} and
2333 @samp{-@var{pos}} parts of a field specification, and if it is inherited
2334 from the global options it will be attached to both.
2335 Keys may span multiple fields.
2336
2337 Here are some examples to illustrate various combinations of options.
2338 In them, the @sc{posix} @samp{-k} option is used to specify sort keys rather
2339 than the obsolete @samp{+@var{pos1}-@var{pos2}} syntax.
2340
2341 @itemize @bullet
2342
2343 @item
2344 Sort in descending (reverse) numeric order.
2345
2346 @example
2347 sort -nr
2348 @end example
2349
2350 Sort alphabetically, omitting the first and second fields.
2351 This uses a single key composed of the characters beginning
2352 at the start of field three and extending to the end of each line.
2353
2354 @example
2355 sort -k3
2356 @end example
2357
2358 @item
2359 Sort numerically on the second field and resolve ties by sorting
2360 alphabetically on the third and fourth characters of field five.
2361 Use @samp{:} as the field delimiter.
2362
2363 @example
2364 sort -t : -k 2,2n -k 5.3,5.4
2365 @end example
2366
2367 Note that if you had written @samp{-k 2} instead of @samp{-k 2,2}
2368 @samp{sort} would have used all characters beginning in the second field
2369 and extending to the end of the line as the primary @emph{numeric}
2370 key.  For the large majority of applications, treating keys spanning
2371 more than one field as numeric will not do what you expect.
2372
2373 Also note that the @samp{n} modifier was applied to the field-end
2374 specifier for the first key.  It would have been equivalent to
2375 specify @samp{-k 2n,2} or @samp{-k 2n,2n}.  All modifiers except
2376 @samp{b} apply to the associated @emph{field}, regardless of whether
2377 the modifier character is attached to the field-start and/or the
2378 field-end part of the key specifier.
2379
2380 @item
2381 Sort the password file on the fifth field and ignore any
2382 leading white space.  Sort lines with equal values in field five
2383 on the numeric user ID in field three.
2384
2385 @example
2386 sort -t : -k 5b,5 -k 3,3n /etc/passwd
2387 @end example
2388
2389 An alternative is to use the global numeric modifier @samp{-n}.
2390
2391 @example
2392 sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
2393 @end example
2394
2395 @item
2396 Generate a tags file in case insensitive sorted order.
2397 @example
2398 find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
2399 @end example
2400
2401 The use of @samp{-print0}, @samp{-z}, and @samp{-0} in this case mean
2402 that pathnames that contain Line Feed characters will not get broken up
2403 by the sort operation.
2404
2405 Finally, to ignore both leading and trailing white space, you
2406 could have applied the @samp{b} modifier to the field-end specifier
2407 for the first key,
2408
2409 @example
2410 sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
2411 @end example
2412
2413 or by using the global @samp{-b} modifier instead of @samp{-n}
2414 and an explicit @samp{n} with the second key specifier.
2415
2416 @example
2417 sort -t : -b -k 5,5 -k 3,3n /etc/passwd
2418 @end example
2419
2420 @c This example is a bit contrived and needs more explanation.
2421 @c @item
2422 @c Sort records separated by an arbitrary string by using a pipe to convert
2423 @c each record delimiter string to @samp{\0}, then using sort's -z option,
2424 @c and converting each @samp{\0} back to the original record delimiter.
2425 @c
2426 @c @example
2427 @c printf 'c\n\nb\n\na\n'|perl -0pe 's/\n\n/\n\0/g'|sort -z|perl -0pe 's/\0/\n/g'
2428 @c @end example
2429
2430 @end itemize
2431
2432
2433 @node uniq invocation
2434 @section @code{uniq}: Uniquify files
2435
2436 @pindex uniq
2437 @cindex uniquify files
2438
2439 @code{uniq} writes the unique lines in the given @file{input}, or
2440 standard input if nothing is given or for an @var{input} name of
2441 @samp{-}.  Synopsis:
2442
2443 @example
2444 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
2445 @end example
2446
2447 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
2448 discards all but one of identical successive lines.  Optionally, it can
2449 instead show only lines that appear exactly once, or lines that appear
2450 more than once.
2451
2452 The input must be sorted.  If your input is not sorted, perhaps you want
2453 to use @code{sort -u}.
2454
2455 If no @var{output} file is specified, @code{uniq} writes to standard
2456 output.
2457
2458 The program accepts the following options.  Also see @ref{Common options}.
2459
2460 @table @samp
2461
2462 @item -@var{n}
2463 @itemx -f @var{n}
2464 @itemx --skip-fields=@var{n}
2465 @opindex -@var{n}
2466 @opindex -f
2467 @opindex --skip-fields
2468 Skip @var{n} fields on each line before checking for uniqueness.  Fields
2469 are sequences of non-space non-tab characters that are separated from
2470 each other by at least one spaces or tabs.
2471
2472 @item +@var{n}
2473 @itemx -s @var{n}
2474 @itemx --skip-chars=@var{n}
2475 @opindex +@var{n}
2476 @opindex -s
2477 @opindex --skip-chars
2478 Skip @var{n} characters before checking for uniqueness.  If you use both
2479 the field and character skipping options, fields are skipped over first.
2480
2481 @item -c
2482 @itemx --count
2483 @opindex -c
2484 @opindex --count
2485 Print the number of times each line occurred along with the line.
2486
2487 @item -i
2488 @itemx --ignore-case
2489 @opindex -i
2490 @opindex --ignore-case
2491 Ignore differences in case when comparing lines.
2492
2493 @item -d
2494 @itemx --repeated
2495 @opindex -d
2496 @opindex --repeated
2497 @cindex duplicate lines, outputting
2498 Print only duplicate lines.
2499
2500 @item -D
2501 @itemx --all-repeated
2502 @opindex -D
2503 @opindex --all-repeated
2504 @cindex all duplicate lines, outputting
2505 Print all duplicate lines and only duplicate lines.
2506 This option is useful mainly in conjunction with other options e.g.,
2507 to ignore case or to compare only selected fields.
2508 This is a GNU extension.
2509 @c FIXME: give an example showing *how* it's useful
2510
2511 @item -u
2512 @itemx --unique
2513 @opindex -u
2514 @opindex --unique
2515 @cindex unique lines, outputting
2516 Print only unique lines.
2517
2518 @item -w @var{n}
2519 @itemx --check-chars=@var{n}
2520 @opindex -w
2521 @opindex --check-chars
2522 Compare @var{n} characters on each line (after skipping any specified
2523 fields and characters).  By default the entire rest of the lines are
2524 compared.
2525
2526 @end table
2527
2528
2529 @node comm invocation
2530 @section @code{comm}: Compare two sorted files line by line
2531
2532 @pindex comm
2533 @cindex line-by-line comparison
2534 @cindex comparing sorted files
2535
2536 @code{comm} writes to standard output lines that are common, and lines
2537 that are unique, to two input files; a file name of @samp{-} means
2538 standard input.  Synopsis:
2539
2540 @example
2541 comm [@var{option}]@dots{} @var{file1} @var{file2}
2542 @end example
2543
2544 @vindex LC_COLLATE
2545 Before @code{comm} can be used, the input files must be sorted using the
2546 collating sequence specified by the @env{LC_COLLATE} locale, with
2547 trailing newlines significant.  If an input file ends in a non-newline
2548 character, a newline is silently appended.  The @code{sort} command with
2549 no options always outputs a file that is suitable input to @code{comm}.
2550
2551 @cindex differing lines
2552 @cindex common lines
2553 With no options, @code{comm} produces three column output.  Column one
2554 contains lines unique to @var{file1}, column two contains lines unique
2555 to @var{file2}, and column three contains lines common to both files.
2556 Columns are separated by a single TAB character.
2557 @c FIXME: when there's an option to supply an alternative separator
2558 @c string, append `by default' to the above sentence.
2559
2560 @opindex -1
2561 @opindex -2
2562 @opindex -3
2563 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
2564 the corresponding columns.  Also see @ref{Common options}.
2565
2566 Unlike some other comparison utilities, @code{comm} has an exit
2567 status that does not depend on the result of the comparison.
2568 Upon normal completion @code{comm} produces an exit code of zero.
2569 If there is an error it exits with nonzero status.
2570
2571
2572 @node tsort invocation
2573 @section @code{tsort}: Topological sort
2574
2575 @pindex tsort
2576 @cindex topological sort
2577
2578 @code{tsort} performs a topological sort on the given @var{file}, or
2579 standard input if no input file is given or for a @var{file} of
2580 @samp{-}.  Synopsis:
2581
2582 @example
2583 tsort [@var{option}] [@var{file}]
2584 @end example
2585
2586 @code{tsort} reads its input as pairs of strings, separated by blanks,
2587 indicating a partial ordering.  The output is a total ordering that
2588 corresponds to the given partial ordering.
2589
2590 For example
2591
2592 @example
2593 tsort <<EOF
2594 a b c
2595 d
2596 e f
2597 b c d e
2598 EOF
2599 @end example
2600
2601 @noindent
2602 will produce the output
2603
2604 @example
2605 a
2606 b
2607 c
2608 d
2609 e
2610 f
2611 @end example
2612
2613 @code{tsort} will detect cycles in the input and writes the first cycle
2614 encountered to standard error.
2615
2616 Note that for a given partial ordering, generally there is no unique
2617 total ordering.
2618
2619 The only options are @samp{--help} and @samp{--version}.  @xref{Common
2620 options}.
2621
2622
2623 @node ptx invocation
2624 @section @code{ptx}: Produce permuted indexes
2625
2626 @pindex ptx
2627
2628 @code{ptx} reads a text file and essentially produces a permuted index, with
2629 each keyword in its context.  The calling sketch is either one of:
2630
2631 @example
2632 ptx [@var{option} @dots{}] [@var{file} @dots{}]
2633 ptx -G [@var{option} @dots{}] [@var{input} [@var{output}]]
2634 @end example
2635
2636 The @samp{-G} (or its equivalent: @samp{--traditional}) option disables
2637 all GNU extensions and revert to traditional mode, thus introducing some
2638 limitations, and changes several of the program's default option values.
2639 When @samp{-G} is not specified, GNU extensions are always enabled.  GNU
2640 extensions to @code{ptx} are documented wherever appropriate in this
2641 document.  For the full list, see @xref{Compatibility in ptx}.
2642
2643 Individual options are explained in incoming sections.
2644
2645 When GNU extensions are enabled, there may be zero, one or several
2646 @var{file} after the options.  If there is no @var{file}, the program
2647 reads the standard input.  If there is one or several @var{file}, they
2648 give the name of input files which are all read in turn, as if all the
2649 input files were concatenated.  However, there is a full contextual
2650 break between each file and, when automatic referencing is requested,
2651 file names and line numbers refer to individual text input files.  In
2652 all cases, the program produces the permuted index onto the standard
2653 output.
2654
2655 When GNU extensions are @emph{not} enabled, that is, when the program
2656 operates in traditional mode, there may be zero, one or two parameters
2657 besides the options.  If there is no parameters, the program reads the
2658 standard input and produces the permuted index onto the standard output.
2659 If there is only one parameter, it names the text @var{input} to be read
2660 instead of the standard input.  If two parameters are given, they give
2661 respectively the name of the @var{input} file to read and the name of
2662 the @var{output} file to produce.  @emph{Be very careful} to note that,
2663 in this case, the contents of file given by the second parameter is
2664 destroyed.  This behaviour is dictated only by System V @code{ptx}
2665 compatibility, because GNU Standards discourage output parameters not
2666 introduced by an option.
2667
2668 Note that for @emph{any} file named as the value of an option or as an
2669 input text file, a single dash @kbd{-} may be used, in which case
2670 standard input is assumed.  However, it would not make sense to use this
2671 convention more than once per program invocation.
2672
2673 @menu
2674 * General options in ptx::      Options which affect general program behaviour.
2675 * Charset selection in ptx::    Underlying character set considerations.
2676 * Input processing in ptx::     Input fields, contexts, and keyword selection.
2677 * Output formatting in ptx::    Types of output format, and sizing the fields.
2678 * Compatibility in ptx::
2679 @end menu
2680
2681
2682 @node General options in ptx
2683 @subsection General options
2684
2685 @table @samp
2686
2687 @item -C
2688 @itemx --copyright
2689 Prints a short note about the Copyright and copying conditions, then
2690 exit without further processing.
2691
2692 @item -G
2693 @itemx --traditional
2694 As already explained, this option disables all GNU extensions to
2695 @code{ptx} and switch to traditional mode.
2696
2697 @item --help
2698 Prints a short help on standard output, then exit without further
2699 processing.
2700
2701 @item --version
2702 Prints the program verison on standard output, then exit without further
2703 processing.
2704
2705 @end table
2706
2707
2708 @node Charset selection in ptx
2709 @subsection Charset selection
2710
2711 As it is setup now, the program assumes that the input file is coded
2712 using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
2713 @emph{unless} if it is compiled for MS-DOS, in which case it uses the
2714 character set of the IBM-PC.  (GNU @code{ptx} is not known to work on
2715 smaller MS-DOS machines anymore.)  Compared to 7-bit @sc{ascii}, the set of
2716 characters which are letters is then different, this fact alters the
2717 behaviour of regular expression matching.  Thus, the default regular
2718 expression for a keyword allows foreign or diacriticized letters.
2719 Keyword sorting, however, is still crude; it obeys the underlying
2720 character set ordering quite blindly.
2721
2722 @table @samp
2723
2724 @item -f
2725 @itemx --ignore-case
2726 Fold lower case letters to upper case for sorting.
2727
2728 @end table
2729
2730
2731 @node Input processing in ptx
2732 @subsection Word selection and input processing
2733
2734 @table @samp
2735
2736 @item -b @var{file}
2737 @item --break-file=@var{file}
2738
2739 This option provides an alternative (to @samp{-W}) method of describing
2740 which characters make up words.  It introduces the name of a
2741 file which contains a list of characters which can@emph{not} be part of
2742 one word, this file is called the @dfn{Break file}.  Any character which
2743 is not part of the Break file is a word constituent.  If both options
2744 @samp{-b} and @samp{-W} are specified, then @samp{-W} has precedence and
2745 @samp{-b} is ignored.
2746
2747 When GNU extensions are enabled, the only way to avoid newline as a
2748 break character is to write all the break characters in the file with no
2749 newline at all, not even at the end of the file.  When GNU extensions
2750 are disabled, spaces, tabs and newlines are always considered as break
2751 characters even if not included in the Break file.
2752
2753 @item -i @var{file}
2754 @itemx --ignore-file=@var{file}
2755
2756 The file associated with this option contains a list of words which will
2757 never be taken as keywords in concordance output.  It is called the
2758 @dfn{Ignore file}.  The file contains exactly one word in each line; the
2759 end of line separation of words is not subject to the value of the
2760 @samp{-S} option.
2761
2762 There is a default Ignore file used by @code{ptx} when this option is
2763 not specified, usually found in @file{/usr/local/lib/eign} if this has
2764 not been changed at installation time.  If you want to deactivate the
2765 default Ignore file, specify @code{/dev/null} instead.
2766
2767 @item -o @var{file}
2768 @itemx --only-file=@var{file}
2769
2770 The file associated with this option contains a list of words which will
2771 be retained in concordance output, any word not mentioned in this file
2772 is ignored.  The file is called the @dfn{Only file}.  The file contains
2773 exactly one word in each line; the end of line separation of words is
2774 not subject to the value of the @samp{-S} option.
2775
2776 There is no default for the Only file.  In the case there are both an
2777 Only file and an Ignore file, a word will be subject to be a keyword
2778 only if it is given in the Only file and not given in the Ignore file.
2779
2780 @item -r
2781 @itemx --references
2782
2783 On each input line, the leading sequence of non white characters will be
2784 taken to be a reference that has the purpose of identifying this input
2785 line on the produced permuted index.  For more information about reference
2786 production, see @xref{Output formatting in ptx}.
2787 Using this option changes the default value for option @samp{-S}.
2788
2789 Using this option, the program does not try very hard to remove
2790 references from contexts in output, but it succeeds in doing so
2791 @emph{when} the context ends exactly at the newline.  If option
2792 @samp{-r} is used with @samp{-S} default value, or when GNU extensions
2793 are disabled, this condition is always met and references are completely
2794 excluded from the output contexts.
2795
2796 @item -S @var{regexp}
2797 @itemx --sentence-regexp=@var{regexp}
2798
2799 This option selects which regular expression will describe the end of a
2800 line or the end of a sentence.  In fact, there is other distinction
2801 between end of lines or end of sentences than the effect of this regular
2802 expression, and input line boundaries have no special significance
2803 outside this option.  By default, when GNU extensions are enabled and if
2804 @samp{-r} option is not used, end of sentences are used.  In this
2805 case, the precise @var{regex} is imported from GNU emacs:
2806
2807 @example
2808 [.?!][]\"')@}]*\\($\\|\t\\|  \\)[ \t\n]*
2809 @end example
2810
2811 Whenever GNU extensions are disabled or if @samp{-r} option is used, end
2812 of lines are used; in this case, the default @var{regexp} is just:
2813
2814 @example
2815 \n
2816 @end example
2817
2818 Using an empty @var{regexp} is equivalent to completely disabling end of
2819 line or end of sentence recognition.  In this case, the whole file is
2820 considered to be a single big line or sentence.  The user might want to
2821 disallow all truncation flag generation as well, through option @samp{-F
2822 ""}.  @xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs
2823 Manual}.
2824
2825 When the keywords happen to be near the beginning of the input line or
2826 sentence, this often creates an unused area at the beginning of the
2827 output context line; when the keywords happen to be near the end of the
2828 input line or sentence, this often creates an unused area at the end of
2829 the output context line.  The program tries to fill those unused areas
2830 by wrapping around context in them; the tail of the input line or
2831 sentence is used to fill the unused area on the left of the output line;
2832 the head of the input line or sentence is used to fill the unused area
2833 on the right of the output line.
2834
2835 As a matter of convenience to the user, many usual backslashed escape
2836 sequences, as found in the C language, are recognized and converted to
2837 the corresponding characters by @code{ptx} itself.
2838
2839 @item -W @var{regexp}
2840 @itemx --word-regexp=@var{regexp}
2841
2842 This option selects which regular expression will describe each keyword.
2843 By default, if GNU extensions are enabled, a word is a sequence of
2844 letters; the @var{regexp} used is @samp{\w+}.  When GNU extensions are
2845 disabled, a word is by default anything which ends with a space, a tab
2846 or a newline; the @var{regexp} used is @samp{[^ \t\n]+}.
2847
2848 An empty @var{regexp} is equivalent to not using this option, letting the
2849 default dive in.  @xref{Regexps, , Syntax of Regular Expressions, emacs,
2850 The GNU Emacs Manual}.
2851
2852 As a matter of convenience to the user, many usual backslashed escape
2853 sequences, as found in the C language, are recognized and converted to
2854 the corresponding characters by @code{ptx} itself.
2855
2856 @end table
2857
2858
2859 @node Output formatting in ptx
2860 @subsection Output formatting
2861
2862 Output format is mainly controlled by @samp{-O} and @samp{-T} options,
2863 described in the table below.  When neither @samp{-O} nor @samp{-T} is
2864 selected, and if GNU extensions are enabled, the program choose an
2865 output format suited for a dumb terminal.  Each keyword occurrence is
2866 output to the center of one line, surrounded by its left and right
2867 contexts.  Each field is properly justified, so the concordance output
2868 could readily be observed.  As a special feature, if automatic
2869 references are selected by option @samp{-A} and are output before the
2870 left context, that is, if option @samp{-R} is @emph{not} selected, then
2871 a colon is added after the reference; this nicely interfaces with GNU
2872 Emacs @code{next-error} processing.  In this default output format, each
2873 white space character, like newline and tab, is merely changed to
2874 exactly one space, with no special attempt to compress consecutive
2875 spaces.  This might change in the future.  Except for those white space
2876 characters, every other character of the underlying set of 256
2877 characters is transmitted verbatim.
2878
2879 Output format is further controlled by the following options.
2880
2881 @table @samp
2882
2883 @item -g @var{number}
2884 @itemx --gap-size=@var{number}
2885
2886 Select the size of the minimum white gap between the fields on the output
2887 line.
2888
2889 @item -w @var{number}
2890 @itemx --width=@var{number}
2891
2892 Select the output maximum width of each final line.  If references are
2893 used, they are included or excluded from the output maximum width
2894 depending on the value of option @samp{-R}.  If this option is not
2895 selected, that is, when references are output before the left context,
2896 the output maximum width takes into account the maximum length of all
2897 references.  If this options is selected, that is, when references are
2898 output after the right context, the output maximum width does not take
2899 into account the space taken by references, nor the gap that precedes
2900 them.
2901
2902 @item -A
2903 @itemx --auto-reference
2904
2905 Select automatic references.  Each input line will have an automatic
2906 reference made up of the file name and the line ordinal, with a single
2907 colon between them.  However, the file name will be empty when standard
2908 input is being read.  If both @samp{-A} and @samp{-r} are selected, then
2909 the input reference is still read and skipped, but the automatic
2910 reference is used at output time, overriding the input reference.
2911
2912 @item -R
2913 @itemx --right-side-refs
2914
2915 In default output format, when option @samp{-R} is not used, any
2916 reference produced by the effect of options @samp{-r} or @samp{-A} are
2917 given to the far right of output lines, after the right context.  In
2918 default output format, when option @samp{-R} is specified, references
2919 are rather given to the beginning of each output line, before the left
2920 context.  For any other output format, option @samp{-R} is almost
2921 ignored, except for the fact that the width of references is @emph{not}
2922 taken into account in total output width given by @samp{-w} whenever
2923 @samp{-R} is selected.
2924
2925 This option is automatically selected whenever GNU extensions are
2926 disabled.
2927
2928 @item -F @var{string}
2929 @itemx --flac-truncation=@var{string}
2930
2931 This option will request that any truncation in the output be reported
2932 using the string @var{string}.  Most output fields theoretically extend
2933 towards the beginning or the end of the current line, or current
2934 sentence, as selected with option @samp{-S}.  But there is a maximum
2935 allowed output line width, changeable through option @samp{-w}, which is
2936 further divided into space for various output fields.  When a field has
2937 to be truncated because cannot extend until the beginning or the end of
2938 the current line to fit in the, then a truncation occurs.  By default,
2939 the string used is a single slash, as in @samp{-F /}.
2940
2941 @var{string} may have more than one character, as in @samp{-F ...}.
2942 Also, in the particular case @var{string} is empty (@samp{-F ""}),
2943 truncation flagging is disabled, and no truncation marks are appended in
2944 this case.
2945
2946 As a matter of convenience to the user, many usual backslashed escape
2947 sequences, as found in the C language, are recognized and converted to
2948 the corresponding characters by @code{ptx} itself.
2949
2950 @item -M @var{string}
2951 @itemx --macro-name=@var{string}
2952
2953 Select another @var{string} to be used instead of @samp{xx}, while
2954 generating output suitable for @code{nroff}, @code{troff} or @TeX{}.
2955
2956 @item -O
2957 @itemx --format=roff
2958
2959 Choose an output format suitable for @code{nroff} or @code{troff}
2960 processing.  Each output line will look like:
2961
2962 @example
2963 .xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}"
2964 @end example
2965
2966 so it will be possible to write an @samp{.xx} roff macro to take care of
2967 the output typesetting.  This is the default output format when GNU
2968 extensions are disabled.  Option @samp{-M} might be used to change
2969 @samp{xx} to another macro name.
2970
2971 In this output format, each non-graphical character, like newline and
2972 tab, is merely changed to exactly one space, with no special attempt to
2973 compress consecutive spaces.  Each quote character: @kbd{"} is doubled
2974 so it will be correctly processed by @code{nroff} or @code{troff}.
2975
2976 @item -T
2977 @itemx --format=tex
2978
2979 Choose an output format suitable for @TeX{} processing.  Each output
2980 line will look like:
2981
2982 @example
2983 \xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@}
2984 @end example
2985
2986 @noindent
2987 so it will be possible to write a @code{\xx} definition to take care of
2988 the output typesetting.  Note that when references are not being
2989 produced, that is, neither option @samp{-A} nor option @samp{-r} is
2990 selected, the last parameter of each @code{\xx} call is inhibited.
2991 Option @samp{-M} might be used to change @samp{xx} to another macro
2992 name.
2993
2994 In this output format, some special characters, like @kbd{$}, @kbd{%},
2995 @kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a
2996 backslash.  Curly brackets @kbd{@{}, @kbd{@}} are also protected with a
2997 backslash, but also enclosed in a pair of dollar signs to force
2998 mathematical mode.  The backslash itself produces the sequence
2999 @code{\backslash@{@}}.  Circumflex and tilde diacritics produce the
3000 sequence @code{^\@{ @}} and @code{~\@{ @}} respectively.  Other
3001 diacriticized characters of the underlying character set produce an
3002 appropriate @TeX{} sequence as far as possible.  The other non-graphical
3003 characters, like newline and tab, and all others characters which are
3004 not part of @sc{ascii}, are merely changed to exactly one space, with no
3005 special attempt to compress consecutive spaces.  Let me know how to
3006 improve this special character processing for @TeX{}.
3007
3008 @end table
3009
3010
3011 @node Compatibility in ptx
3012 @subsection The GNU extensions to @code{ptx}
3013
3014 This version of @code{ptx} contains a few features which do not exist in
3015 System V @code{ptx}.  These extra features are suppressed by using the
3016 @samp{-G} command line option, unless overridden by other command line
3017 options.  Some GNU extensions cannot be recovered by overriding, so the
3018 simple rule is to avoid @samp{-G} if you care about GNU extensions.
3019 Here are the differences between this program and System V @code{ptx}.
3020
3021 @itemize @bullet
3022
3023 @item
3024 This program can read many input files at once, it always writes the
3025 resulting concordance on standard output.  On the other end, System V
3026 @code{ptx} reads only one file and produce the result on standard output
3027 or, if a second @var{file} parameter is given on the command, to that
3028 @var{file}.
3029
3030 Having output parameters not introduced by options is a quite dangerous
3031 practice which GNU avoids as far as possible.  So, for using @code{ptx}
3032 portably between GNU and System V, you should pay attention to always
3033 use it with a single input file, and always expect the result on
3034 standard output.  You might also want to automatically configure in a
3035 @samp{-G} option to @code{ptx} calls in products using @code{ptx}, if
3036 the configurator finds that the installed @code{ptx} accepts @samp{-G}.
3037
3038 @item
3039 The only options available in System V @code{ptx} are options @samp{-b},
3040 @samp{-f}, @samp{-g}, @samp{-i}, @samp{-o}, @samp{-r}, @samp{-t} and
3041 @samp{-w}.  All other options are GNU extensions and are not repeated in
3042 this enumeration.  Moreover, some options have a slightly different
3043 meaning when GNU extensions are enabled, as explained below.
3044
3045 @item
3046 By default, concordance output is not formatted for @code{troff} or
3047 @code{nroff}.  It is rather formatted for a dumb terminal.  @code{troff}
3048 or @code{nroff} output may still be selected through option @samp{-O}.
3049
3050 @item
3051 Unless @samp{-R} option is used, the maximum reference width is
3052 subtracted from the total output line width.  With GNU extensions
3053 disabled, width of references is not taken into account in the output
3054 line width computations.
3055
3056 @item
3057 All 256 characters, even @kbd{NUL}s, are always read and processed from
3058 input file with no adverse effect, even if GNU extensions are disabled.
3059 However, System V @code{ptx} does not accept 8-bit characters, a few
3060 control characters are rejected, and the tilde @kbd{~} is condemned.
3061
3062 @item
3063 Input line length is only limited by available memory, even if GNU
3064 extensions are disabled.  However, System V @code{ptx} processes only
3065 the first 200 characters in each line.
3066
3067 @item
3068 The break (non-word) characters default to be every character except all
3069 letters of the underlying character set, diacriticized or not.  When GNU
3070 extensions are disabled, the break characters default to space, tab and
3071 newline only.
3072
3073 @item
3074 The program makes better use of output line width.  If GNU extensions
3075 are disabled, the program rather tries to imitate System V @code{ptx},
3076 but still, there are some slight disposition glitches this program does
3077 not completely reproduce.
3078
3079 @item
3080 The user can specify both an Ignore file and an Only file.  This is not
3081 allowed with System V @code{ptx}.
3082
3083 @end itemize
3084
3085
3086 @node Operating on fields within a line
3087 @chapter Operating on fields within a line
3088
3089 @menu
3090 * cut invocation::              Print selected parts of lines.
3091 * paste invocation::            Merge lines of files.
3092 * join invocation::             Join lines on a common field.
3093 @end menu
3094
3095
3096 @node cut invocation
3097 @section @code{cut}: Print selected parts of lines
3098
3099 @pindex cut
3100 @code{cut} writes to standard output selected parts of each line of each
3101 input file, or standard input if no files are given or for a file name of
3102 @samp{-}.  Synopsis:
3103
3104 @example
3105 cut [@var{option}]@dots{} [@var{file}]@dots{}
3106 @end example
3107
3108 In the table which follows, the @var{byte-list}, @var{character-list},
3109 and @var{field-list} are one or more numbers or ranges (two numbers
3110 separated by a dash) separated by commas.  Bytes, characters, and
3111 fields are numbered from starting at 1.  Incomplete ranges may be
3112 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
3113 @samp{@var{n}} through end of line or last field.
3114
3115 The program accepts the following options.  Also see @ref{Common
3116 options}.
3117
3118 @table @samp
3119
3120 @item -b @var{byte-list}
3121 @itemx --bytes=@var{byte-list}
3122 @opindex -b
3123 @opindex --bytes
3124 Print only the bytes in positions listed in @var{byte-list}.  Tabs and
3125 backspaces are treated like any other character; they take up 1 byte.
3126
3127 @item -c @var{character-list}
3128 @itemx --characters=@var{character-list}
3129 @opindex -c
3130 @opindex --characters
3131 Print only characters in positions listed in @var{character-list}.
3132 The same as @samp{-b} for now, but internationalization will change
3133 that.  Tabs and backspaces are treated like any other character; they
3134 take up 1 character.
3135
3136 @item -f @var{field-list}
3137 @itemx --fields=@var{field-list}
3138 @opindex -f
3139 @opindex --fields
3140 Print only the fields listed in @var{field-list}.  Fields are
3141 separated by a TAB character by default.
3142
3143 @item -d @var{input_delim_byte}
3144 @itemx --delimiter=@var{input_delim_byte}
3145 @opindex -d
3146 @opindex --delimiter
3147 For @samp{-f}, fields are separated in the input by the first character
3148 in @var{input_delim_byte} (default is TAB).
3149
3150 @item -n
3151 @opindex -n
3152 Do not split multi-byte characters (no-op for now).
3153
3154 @item -s
3155 @itemx --only-delimited
3156 @opindex -s
3157 @opindex --only-delimited
3158 For @samp{-f}, do not print lines that do not contain the field separator
3159 character.
3160
3161 @itemx --output-delimiter=@var{output_delim_string}
3162 @opindex --output-delimiter
3163 For @samp{-f}, output fields are separated by @var{output_delim_string}
3164 The default is to use the input delimiter.
3165
3166
3167 @end table
3168
3169
3170 @node paste invocation
3171 @section @code{paste}: Merge lines of files
3172
3173 @pindex paste
3174 @cindex merging files
3175
3176 @code{paste} writes to standard output lines consisting of sequentially
3177 corresponding lines of each given file, separated by a TAB character.
3178 Standard input is used for a file name of @samp{-} or if no input files
3179 are given.
3180
3181 Synopsis:
3182
3183 @example
3184 paste [@var{option}]@dots{} [@var{file}]@dots{}
3185 @end example
3186
3187 The program accepts the following options.  Also see @ref{Common options}.
3188
3189 @table @samp
3190
3191 @item -s
3192 @itemx --serial
3193 @opindex -s
3194 @opindex --serial
3195 Paste the lines of one file at a time rather than one line from each
3196 file.
3197
3198 @item -d @var{delim-list}
3199 @itemx --delimiters @var{delim-list}
3200 @opindex -d
3201 @opindex --delimiters
3202 Consecutively use the characters in @var{delim-list} instead of
3203 TAB to separate merged lines.  When @var{delim-list} is
3204 exhausted, start again at its beginning.
3205
3206 @end table
3207
3208
3209 @node join invocation
3210 @section @code{join}: Join lines on a common field
3211
3212 @pindex join
3213 @cindex common field, joining on
3214
3215 @code{join} writes to standard output a line for each pair of input
3216 lines that have identical join fields.  Synopsis:
3217
3218 @example
3219 join [@var{option}]@dots{} @var{file1} @var{file2}
3220 @end example
3221
3222 @vindex LC_COLLATE
3223 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
3224 meaning standard input.  @var{file1} and @var{file2} should be already
3225 sorted in increasing textual order on the join fields, using the
3226 collating sequence specified by the @env{LC_COLLATE} locale.  Unless
3227 the @samp{-t} option is given, the input should be sorted ignoring blanks at
3228 the start of the join field, as in @code{sort -b}.  If the
3229 @samp{--ignore-case} option is given, lines should be sorted without
3230 regard to the case of characters in the join field, as in @code{sort -f}.
3231
3232 The defaults are: the join field is the first field in each line;
3233 fields in the input are separated by one or more blanks, with leading
3234 blanks on the line ignored; fields in the output are separated by a
3235 space; each output line consists of the join field, the remaining
3236 fields from @var{file1}, then the remaining fields from @var{file2}.
3237
3238 The program accepts the following options.  Also see @ref{Common options}.
3239
3240 @table @samp
3241
3242 @item -a @var{file-number}
3243 @opindex -a
3244 Print a line for each unpairable line in file @var{file-number} (either
3245 @samp{1} or @samp{2}), in addition to the normal output.
3246
3247 @item -e @var{string}
3248 @opindex -e
3249 Replace those output fields that are missing in the input with
3250 @var{string}.
3251
3252 @item -i
3253 @itemx --ignore-case
3254 @opindex -i
3255 @opindex --ignore-case
3256 Ignore differences in case when comparing keys.
3257 With this option, the lines of the input files must be ordered in the same way.
3258 Use @samp{sort -f} to produce this ordering.
3259
3260 @item -1 @var{field}
3261 @itemx -j1 @var{field}
3262 @opindex -1
3263 @opindex -j1
3264 Join on field @var{field} (a positive integer) of file 1.
3265
3266 @item -2 @var{field}
3267 @itemx -j2 @var{field}
3268 @opindex -2
3269 @opindex -j2
3270 Join on field @var{field} (a positive integer) of file 2.
3271
3272 @item -j @var{field}
3273 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
3274
3275 @item -o @var{field-list}@dots{}
3276 Construct each output line according to the format in @var{field-list}.
3277 Each element in @var{field-list} is either the single character @samp{0} or
3278 has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
3279 @samp{2} and @var{n} is a positive field number.
3280
3281 A field specification of @samp{0} denotes the join field.
3282 In most cases, the functionality of the @samp{0} field spec
3283 may be reproduced using the explicit @var{m.n} that corresponds
3284 to the join field.  However, when printing unpairable lines
3285 (using either of the @samp{-a} or @samp{-v} options), there is no way
3286 to specify the join field using @var{m.n} in @var{field-list}
3287 if there are unpairable lines in both files.
3288 To give @code{join} that functionality, @sc{posix} invented the @samp{0}
3289 field specification notation.
3290
3291 The elements in @var{field-list}
3292 are separated by commas or blanks.  Multiple @var{field-list}
3293 arguments can be given after a single @samp{-o} option; the values
3294 of all lists given with @samp{-o} are concatenated together.
3295 All output lines -- including those printed because of any -a or -v
3296 option -- are subject to the specified @var{field-list}.
3297
3298 @item -t @var{char}
3299 Use character @var{char} as the input and output field separator.
3300
3301 @item -v @var{file-number}
3302 Print a line for each unpairable line in file @var{file-number}
3303 (either @samp{1} or @samp{2}), instead of the normal output.
3304
3305 @end table
3306
3307 In addition, when GNU @code{join} is invoked with exactly one argument,
3308 options @samp{--help} and @samp{--version} are recognized.  @xref{Common
3309 options}.
3310
3311
3312 @node Operating on characters
3313 @chapter Operating on characters
3314
3315 @cindex operating on characters
3316
3317 This commands operate on individual characters.
3318
3319 @menu
3320 * tr invocation::               Translate, squeeze, and/or delete characters.
3321 * expand invocation::           Convert tabs to spaces.
3322 * unexpand invocation::         Convert spaces to tabs.
3323 @end menu
3324
3325
3326 @node tr invocation
3327 @section @code{tr}: Translate, squeeze, and/or delete characters
3328
3329 @pindex tr
3330
3331 Synopsis:
3332
3333 @example
3334 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
3335 @end example
3336
3337 @code{tr} copies standard input to standard output, performing
3338 one of the following operations:
3339
3340 @itemize @bullet
3341 @item
3342 translate, and optionally squeeze repeated characters in the result,
3343 @item
3344 squeeze repeated characters,
3345 @item
3346 delete characters,
3347 @item
3348 delete characters, then squeeze repeated characters from the result.
3349 @end itemize
3350
3351 The @var{set1} and (if given) @var{set2} arguments define ordered
3352 sets of characters, referred to below as @var{set1} and @var{set2}.  These
3353 sets are the characters of the input that @code{tr} operates on.
3354 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
3355 complement (all of the characters that are not in @var{set1}).
3356
3357 @menu
3358 * Character sets::              Specifying sets of characters.
3359 * Translating::                 Changing one characters to another.
3360 * Squeezing::                   Squeezing repeats and deleting.
3361 * Warnings in tr::              Warning messages.
3362 @end menu
3363
3364
3365 @node Character sets
3366 @subsection Specifying sets of characters
3367
3368 @cindex specifying sets of characters
3369
3370 The format of the @var{set1} and @var{set2} arguments resembles
3371 the format of regular expressions; however, they are not regular
3372 expressions, only lists of characters.  Most characters simply
3373 represent themselves in these strings, but the strings can contain
3374 the shorthands listed below, for convenience.  Some of them can be
3375 used only in @var{set1} or @var{set2}, as noted below.
3376
3377 @table @asis
3378
3379 @item Backslash escapes
3380 @cindex backslash escapes
3381
3382 A backslash followed by a character not listed below causes an error
3383 message.
3384
3385 @table @samp
3386 @item \a
3387 Control-G.
3388 @item \b
3389 Control-H.
3390 @item \f
3391 Control-L.
3392 @item \n
3393 Control-J.
3394 @item \r
3395 Control-M.
3396 @item \t
3397 Control-I.
3398 @item \v
3399 Control-K.
3400 @item \@var{ooo}
3401 The character with the value given by @var{ooo}, which is 1 to 3
3402 octal digits,
3403 @item \\
3404 A backslash.
3405 @end table
3406
3407 @item Ranges
3408 @cindex ranges
3409
3410 The notation @samp{@var{m}-@var{n}} expands to all of the characters
3411 from @var{m} through @var{n}, in ascending order.  @var{m} should
3412 collate before @var{n}; if it doesn't, an error results.  As an example,
3413 @samp{0-9} is the same as @samp{0123456789}.  Although GNU @code{tr}
3414 does not support the System V syntax that uses square brackets to
3415 enclose ranges, translations specified in that format will still work as
3416 long as the brackets in @var{string1} correspond to identical brackets
3417 in @var{string2}.
3418
3419 @item Repeated characters
3420 @cindex repeated characters
3421
3422 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
3423 copies of character @var{c}.  Thus, @samp{[y*6]} is the same as
3424 @samp{yyyyyy}.  The notation @samp{[@var{c}*]} in @var{string2} expands
3425 to as many copies of @var{c} as are needed to make @var{set2} as long as
3426 @var{set1}.  If @var{n} begins with @samp{0}, it is interpreted in
3427 octal, otherwise in decimal.
3428
3429 @item Character classes
3430 @cindex characters classes
3431
3432 The notation @samp{[:@var{class}:]} expands to all of the characters in
3433 the (predefined) class @var{class}.  The characters expand in no
3434 particular order, except for the @code{upper} and @code{lower} classes,
3435 which expand in ascending order.  When the @samp{--delete} (@samp{-d})
3436 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
3437 character class can be used in @var{set2}.  Otherwise, only the
3438 character classes @code{lower} and @code{upper} are accepted in
3439 @var{set2}, and then only if the corresponding character class
3440 (@code{upper} and @code{lower}, respectively) is specified in the same
3441 relative position in @var{set1}.  Doing this specifies case conversion.
3442 The class names are given below; an error results when an invalid class
3443 name is given.
3444
3445 @table @code
3446 @item alnum
3447 @opindex alnum
3448 Letters and digits.
3449 @item alpha
3450 @opindex alpha
3451 Letters.
3452 @item blank
3453 @opindex blank
3454 Horizontal whitespace.
3455 @item cntrl
3456 @opindex cntrl
3457 Control characters.
3458 @item digit
3459 @opindex digit
3460 Digits.
3461 @item graph
3462 @opindex graph
3463 Printable characters, not including space.
3464 @item lower
3465 @opindex lower
3466 Lowercase letters.
3467 @item print
3468 @opindex print
3469 Printable characters, including space.
3470 @item punct
3471 @opindex punct
3472 Punctuation characters.
3473 @item space
3474 @opindex space
3475 Horizontal or vertical whitespace.
3476 @item upper
3477 @opindex upper
3478 Uppercase letters.
3479 @item xdigit
3480 @opindex xdigit
3481 Hexadecimal digits.
3482 @end table
3483
3484 @item Equivalence classes
3485 @cindex equivalence classes
3486
3487 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
3488 equivalent to @var{c}, in no particular order.  Equivalence classes are
3489 a relatively recent invention intended to support non-English alphabets.
3490 But there seems to be no standard way to define them or determine their
3491 contents.  Therefore, they are not fully implemented in GNU @code{tr};
3492 each character's equivalence class consists only of that character,
3493 which is of no particular use.
3494
3495 @end table
3496
3497
3498 @node Translating
3499 @subsection Translating
3500
3501 @cindex translating characters
3502
3503 @code{tr} performs translation when @var{set1} and @var{set2} are
3504 both given and the @samp{--delete} (@samp{-d}) option is not given.
3505 @code{tr} translates each character of its input that is in @var{set1}
3506 to the corresponding character in @var{set2}.  Characters not in
3507 @var{set1} are passed through unchanged.  When a character appears more
3508 than once in @var{set1} and the corresponding characters in @var{set2}
3509 are not all the same, only the final one is used.  For example, these
3510 two commands are equivalent:
3511
3512 @example
3513 tr aaa xyz
3514 tr a z
3515 @end example
3516
3517 A common use of @code{tr} is to convert lowercase characters to
3518 uppercase.  This can be done in many ways.  Here are three of them:
3519
3520 @example
3521 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
3522 tr a-z A-Z
3523 tr '[:lower:]' '[:upper:]'
3524 @end example
3525
3526 When @code{tr} is performing translation, @var{set1} and @var{set2}
3527 typically have the same length.  If @var{set1} is shorter than
3528 @var{set2}, the extra characters at the end of @var{set2} are ignored.
3529
3530 On the other hand, making @var{set1} longer than @var{set2} is not
3531 portable; @sc{posix.2} says that the result is undefined.  In this situation,
3532 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
3533 the last character of @var{set2} as many times as necessary.  System V
3534 @code{tr} truncates @var{set1} to the length of @var{set2}.
3535
3536 By default, GNU @code{tr} handles this case like BSD @code{tr}.  When
3537 the @samp{--truncate-set1} (@samp{-t}) option is given, GNU @code{tr}
3538 handles this case like the System V @code{tr} instead.  This option is
3539 ignored for operations other than translation.
3540
3541 Acting like System V @code{tr} in this case breaks the relatively common
3542 BSD idiom:
3543
3544 @example
3545 tr -cs A-Za-z0-9 '\012'
3546 @end example
3547
3548 @noindent
3549 because it converts only zero bytes (the first element in the
3550 complement of @var{set1}), rather than all non-alphanumerics, to
3551 newlines.
3552
3553
3554 @node Squeezing
3555 @subsection Squeezing repeats and deleting
3556
3557 @cindex squeezing repeat characters
3558 @cindex deleting characters
3559
3560 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
3561 removes any input characters that are in @var{set1}.
3562
3563 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
3564 @code{tr} replaces each input sequence of a repeated character that
3565 is in @var{set1} with a single occurrence of that character.
3566
3567 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
3568 first performs any deletions using @var{set1}, then squeezes repeats
3569 from any remaining characters using @var{set2}.
3570
3571 The @samp{--squeeze-repeats} option may also be used when translating,
3572 in which case @code{tr} first performs translation, then squeezes
3573 repeats from any remaining characters using @var{set2}.
3574
3575 Here are some examples to illustrate various combinations of options:
3576
3577 @itemize @bullet
3578
3579 @item
3580 Remove all zero bytes:
3581
3582 @example
3583 tr -d '\000'
3584 @end example
3585
3586 @item
3587 Put all words on lines by themselves.  This converts all
3588 non-alphanumeric characters to newlines, then squeezes each string
3589 of repeated newlines into a single newline:
3590
3591 @example
3592 tr -cs 'a-zA-Z0-9' '[\n*]'
3593 @end example
3594
3595 @item
3596 Convert each sequence of repeated newlines to a single newline:
3597
3598 @example
3599 tr -s '\n'
3600 @end example
3601
3602 @item
3603 Find doubled occurrences of words in a document.
3604 For example, people often write ``the the'' with the duplicated words
3605 separated by a newline.  The bourne shell script below works first
3606 by converting each sequence of punctuation and blank characters to a
3607 single newline.  That puts each ``word'' on a line by itself.
3608 Next it maps all uppercase characters to lower case, and finally it
3609 runs @code{uniq} with the @samp{-d} option to print out only the words
3610 that were adjacent duplicates.
3611
3612 @example
3613 #!/bin/sh
3614 cat "$@@" \
3615   | tr -s '[:punct:][:blank:]' '\n' \
3616   | tr '[:upper:]' '[:lower:]' \
3617   | uniq -d
3618 @end example
3619
3620 @end itemize
3621
3622
3623 @node Warnings in tr
3624 @subsection Warning messages
3625
3626 @vindex POSIXLY_CORRECT
3627 Setting the environment variable @env{POSIXLY_CORRECT} turns off the
3628 following warning and error messages, for strict compliance with
3629 @sc{posix.2}.  Otherwise, the following diagnostics are issued:
3630
3631 @enumerate
3632
3633 @item
3634 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
3635 is not, and @var{set2} is given, GNU @code{tr} by default prints
3636 a usage message and exits, because @var{set2} would not be used.
3637 The @sc{posix} specification says that @var{set2} must be ignored in
3638 this case. Silently ignoring arguments is a bad idea.
3639
3640 @item
3641 When an ambiguous octal escape is given.  For example, @samp{\400}
3642 is actually @samp{\40} followed by the digit @samp{0}, because the
3643 value 400 octal does not fit into a single byte.
3644
3645 @end enumerate
3646
3647 GNU @code{tr} does not provide complete BSD or System V compatibility.
3648 For example, it is impossible to disable interpretation of the @sc{posix}
3649 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}.  Also, GNU
3650 @code{tr} does not delete zero bytes automatically, unlike traditional
3651 Unix versions, which provide no way to preserve zero bytes.
3652
3653
3654 @node expand invocation
3655 @section @code{expand}: Convert tabs to spaces
3656
3657 @pindex expand
3658 @cindex tabs to spaces, converting
3659 @cindex converting tabs to spaces
3660
3661 @code{expand} writes the contents of each given @var{file}, or standard
3662 input if none are given or for a @var{file} of @samp{-}, to standard
3663 output, with tab characters converted to the appropriate number of
3664 spaces.  Synopsis:
3665
3666 @example
3667 expand [@var{option}]@dots{} [@var{file}]@dots{}
3668 @end example
3669
3670 By default, @code{expand} converts all tabs to spaces.  It preserves
3671 backspace characters in the output; they decrement the column count for
3672 tab calculations.  The default action is equivalent to @samp{-8} (set
3673 tabs every 8 columns).
3674
3675 The program accepts the following options.  Also see @ref{Common options}.
3676
3677 @table @samp
3678
3679 @item -@var{tab1}[,@var{tab2}]@dots{}
3680 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3681 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3682 @opindex -@var{tab}
3683 @opindex -t
3684 @opindex --tabs
3685 @cindex tabstops, setting
3686 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3687 (default is 8).  Otherwise, set the tabs at columns @var{tab1},
3688 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
3689 last tabstop given with single spaces.  If the tabstops are specified
3690 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3691 blanks as well as by commas.
3692
3693 @item -i
3694 @itemx --initial
3695 @opindex -i
3696 @opindex --initial
3697 @cindex initial tabs, converting
3698 Only convert initial tabs (those that precede all non-space or non-tab
3699 characters) on each line to spaces.
3700
3701 @end table
3702
3703
3704 @node unexpand invocation
3705 @section @code{unexpand}: Convert spaces to tabs
3706
3707 @pindex unexpand
3708
3709 @code{unexpand} writes the contents of each given @var{file}, or
3710 standard input if none are given or for a @var{file} of @samp{-}, to
3711 standard output, with strings of two or more space or tab characters
3712 converted to as many tabs as possible followed by as many spaces as are
3713 needed.  Synopsis:
3714
3715 @example
3716 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
3717 @end example
3718
3719 By default, @code{unexpand} converts only initial spaces and tabs (those
3720 that precede all non space or tab characters) on each line.  It
3721 preserves backspace characters in the output; they decrement the column
3722 count for tab calculations.  By default, tabs are set at every 8th
3723 column.
3724
3725 The program accepts the following options.  Also see @ref{Common options}.
3726
3727 @table @samp
3728
3729 @item -@var{tab1}[,@var{tab2}]@dots{}
3730 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3731 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3732 @opindex -@var{tab}
3733 @opindex -t
3734 @opindex --tabs
3735 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3736 instead of the default 8.  Otherwise, set the tabs at columns
3737 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
3738 tabs beyond the tabstops given unchanged.  If the tabstops are specified
3739 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3740 blanks as well as by commas.  This option implies the @samp{-a} option.
3741
3742 @item -a
3743 @itemx --all
3744 @opindex -a
3745 @opindex --all
3746 Convert all strings of two or more spaces or tabs, not just initial
3747 ones, to tabs.
3748
3749 @end table
3750
3751 @c              What's GNU?
3752 @c              Arnold Robbins
3753 @node Opening the software toolbox
3754 @chapter Opening the software toolbox
3755
3756 This chapter originally appeared in @cite{Linux Journal}, volume 1,
3757 number 2, in the @cite{What's GNU?} column. It was written by Arnold
3758 Robbins.
3759
3760 @menu
3761 * Toolbox introduction::        Toolbox introduction
3762 * I/O redirection::             I/O redirection
3763 * The who command::             The @code{who} command
3764 * The cut command::             The @code{cut} command
3765 * The sort command::            The @code{sort} command
3766 * The uniq command::            The @code{uniq} command
3767 * Putting the tools together::  Putting the tools together
3768 @end menu
3769
3770
3771 @node Toolbox introduction
3772 @unnumberedsec Toolbox introduction
3773
3774 This month's column is only peripherally related to the GNU Project, in
3775 that it describes a number of the GNU tools on your Linux system and how they
3776 might be used.  What it's really about is the ``Software Tools'' philosophy
3777 of program development and usage.
3778
3779 The software tools philosophy was an important and integral concept
3780 in the initial design and development of Unix (of which Linux and GNU are
3781 essentially clones).  Unfortunately, in the modern day press of
3782 Internetworking and flashy GUIs, it seems to have fallen by the
3783 wayside.  This is a shame, since it provides a powerful mental model
3784 for solving many kinds of problems.
3785
3786 Many people carry a Swiss Army knife around in their pants pockets (or
3787 purse).  A Swiss Army knife is a handy tool to have: it has several knife
3788 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
3789 a number of other things on it.  For the everyday, small miscellaneous jobs
3790 where you need a simple, general purpose tool, it's just the thing.
3791
3792 On the other hand, an experienced carpenter doesn't build a house using
3793 a Swiss Army knife.  Instead, he has a toolbox chock full of specialized
3794 tools---a saw, a hammer, a screwdriver, a plane, and so on.  And he knows
3795 exactly when and where to use each tool; you won't catch him hammering nails
3796 with the handle of his screwdriver.
3797
3798 The Unix developers at Bell Labs were all professional programmers and trained
3799 computer scientists.  They had found that while a one-size-fits-all program
3800 might appeal to a user because there's only one program to use, in practice
3801 such programs are
3802
3803 @enumerate a
3804 @item
3805 difficult to write,
3806
3807 @item
3808 difficult to maintain and
3809 debug, and
3810
3811 @item
3812 difficult to extend to meet new situations.
3813 @end enumerate
3814
3815 Instead, they felt that programs should be specialized tools.  In short, each
3816 program ``should do one thing well.''  No more and no less.  Such programs are
3817 simpler to design, write, and get right---they only do one thing.
3818
3819 Furthermore, they found that with the right machinery for hooking programs
3820 together, that the whole was greater than the sum of the parts.  By combining
3821 several special purpose programs, you could accomplish a specific task
3822 that none of the programs was designed for, and accomplish it much more
3823 quickly and easily than if you had to write a special purpose program.
3824 We will see some (classic) examples of this further on in the column.
3825 (An important additional point was that, if necessary, take a detour
3826 and build any software tools you may need first, if you don't already
3827 have something appropriate in the toolbox.)
3828
3829 @node I/O redirection
3830 @unnumberedsec I/O redirection
3831
3832 Hopefully, you are familiar with the basics of I/O redirection in the
3833 shell, in particular the concepts of ``standard input,'' ``standard output,''
3834 and ``standard error''.  Briefly, ``standard input'' is a data source, where
3835 data comes from.  A program should not need to either know or care if the
3836 data source is a disk file, a keyboard, a magnetic tape, or even a punched
3837 card reader.  Similarly, ``standard output'' is a data sink, where data goes
3838 to.  The program should neither know nor care where this might be.
3839 Programs that only read their standard input, do something to the data,
3840 and then send it on, are called ``filters'', by analogy to filters in a
3841 water pipeline.
3842
3843 With the Unix shell, it's very easy to set up data pipelines:
3844
3845 @example
3846 program_to_create_data | filter1 | .... | filterN > final.pretty.data
3847 @end example
3848
3849 We start out by creating the raw data; each filter applies some successive
3850 transformation to the data, until by the time it comes out of the pipeline,
3851 it is in the desired form.
3852
3853 This is fine and good for standard input and standard output.  Where does the
3854 standard error come in to play?  Well, think about @code{filter1} in
3855 the pipeline above.  What happens if it encounters an error in the data it
3856 sees?  If it writes an error message to standard output, it will just
3857 disappear down the pipeline into @code{filter2}'s input, and the
3858 user will probably never see it.  So programs need a place where they can send
3859 error messages so that the user will notice them.  This is standard error,
3860 and it is usually connected to your console or window, even if you have
3861 redirected standard output of your program away from your screen.
3862
3863 For filter programs to work together, the format of the data has to be
3864 agreed upon.  The most straightforward and easiest format to use is simply
3865 lines of text.  Unix data files are generally just streams of bytes, with
3866 lines delimited by the @sc{ascii} @sc{lf} (Line Feed) character,
3867 conventionally called a ``newline'' in the Unix literature. (This is
3868 @code{'\n'} if you're a C programmer.)  This is the format used by all
3869 the traditional filtering programs.  (Many earlier operating systems
3870 had elaborate facilities and special purpose programs for managing
3871 binary data.  Unix has always shied away from such things, under the
3872 philosophy that it's easiest to simply be able to view and edit your
3873 data with a text editor.)
3874
3875 OK, enough introduction. Let's take a look at some of the tools, and then
3876 we'll see how to hook them together in interesting ways.   In the following
3877 discussion, we will only present those command line options that interest
3878 us.  As you should always do, double check your system documentation
3879 for the full story.
3880
3881 @node The who command
3882 @unnumberedsec The @code{who} command
3883
3884 The first program is the @code{who} command.  By itself, it generates a
3885 list of the users who are currently logged in.  Although I'm writing
3886 this on a single-user system, we'll pretend that several people are
3887 logged in:
3888
3889 @example
3890 $ who
3891 arnold   console Jan 22 19:57
3892 miriam   ttyp0   Jan 23 14:19(:0.0)
3893 bill     ttyp1   Jan 21 09:32(:0.0)
3894 arnold   ttyp2   Jan 23 20:48(:0.0)
3895 @end example
3896
3897 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
3898 There are three people logged in, and I am logged in twice.  On traditional
3899 Unix systems, user names are never more than eight characters long.  This
3900 little bit of trivia will be useful later.  The output of @code{who} is nice,
3901 but the data is not all that exciting.
3902
3903 @node The cut command
3904 @unnumberedsec The @code{cut} command
3905
3906 The next program we'll look at is the @code{cut} command.  This program
3907 cuts out columns or fields of input data.  For example, we can tell it
3908 to print just the login name and full name from the @file{/etc/passwd
3909 file}.  The @file{/etc/passwd} file has seven fields, separated by
3910 colons:
3911
3912 @example
3913 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
3914 @end example
3915
3916 To get the first and fifth fields, we would use cut like this:
3917
3918 @example
3919 $ cut -d: -f1,5 /etc/passwd
3920 root:Operator
3921 @dots{}
3922 arnold:Arnold D. Robbins
3923 miriam:Miriam A. Robbins
3924 @dots{}
3925 @end example
3926
3927 With the @samp{-c} option, @code{cut} will cut out specific characters
3928 (i.e., columns) in the input lines.  This command looks like it might be
3929 useful for data filtering.
3930
3931
3932 @node The sort command
3933 @unnumberedsec The @code{sort} command
3934
3935 Next we'll look at the @code{sort} command.  This is one of the most
3936 powerful commands on a Unix-style system; one that you will often find
3937 yourself using when setting up fancy data plumbing. The @code{sort}
3938 command reads and sorts each file named on the command line.  It then
3939 merges the sorted data and writes it to standard output.  It will read
3940 standard input if no files are given on the command line (thus
3941 making it into a filter).  The sort is based on the character collating
3942 sequence or based on user-supplied ordering criteria.
3943
3944
3945 @node The uniq command
3946 @unnumberedsec The @code{uniq} command
3947
3948 Finally (at least for now), we'll look at the @code{uniq} program.  When
3949 sorting data, you will often end up with duplicate lines, lines that
3950 are identical.  Usually, all you need is one instance of each line.
3951 This is where @code{uniq} comes in. The @code{uniq} program reads its
3952 standard input, which it expects to be sorted.  It only prints out one
3953 copy of each duplicated line.  It does have several options.  Later on,
3954 we'll use the @samp{-c} option, which prints each unique line, preceded
3955 by a count of the number of times that line occurred in the input.
3956
3957
3958 @node Putting the tools together
3959 @unnumberedsec Putting the tools together
3960
3961 Now, let's suppose this is a large BBS system with dozens of users
3962 logged in.  The management wants the SysOp to write a program that will
3963 generate a sorted list of logged in users.  Furthermore, even if a user
3964 is logged in multiple times, his or her name should only show up in the
3965 output once.
3966
3967 The SysOp could sit down with the system documentation and write a C
3968 program that did this. It would take perhaps a couple of hundred lines
3969 of code and about two hours to write it, test it, and debug it.
3970 However, knowing the software toolbox, the SysOp can instead start out
3971 by generating just a list of logged on users:
3972
3973 @example
3974 $ who | cut -c1-8
3975 arnold
3976 miriam
3977 bill
3978 arnold
3979 @end example
3980
3981 Next, sort the list:
3982
3983 @example
3984 $ who | cut -c1-8 | sort
3985 arnold
3986 arnold
3987 bill
3988 miriam
3989 @end example
3990
3991 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
3992
3993 @example
3994 $ who | cut -c1-8 | sort | uniq
3995 arnold
3996 bill
3997 miriam
3998 @end example
3999
4000 The @code{sort} command actually has a @samp{-u} option that does what
4001 @code{uniq} does. However, @code{uniq} has other uses for which one
4002 cannot substitute @samp{sort -u}.
4003
4004 The SysOp puts this pipeline into a shell script, and makes it available for
4005 all the users on the system:
4006
4007 @example
4008 # cat > /usr/local/bin/listusers
4009 who | cut -c1-8 | sort | uniq
4010 ^D
4011 # chmod +x /usr/local/bin/listusers
4012 @end example
4013
4014 There are four major points to note here.  First, with just four
4015 programs, on one command line, the SysOp was able to save about two
4016 hours worth of work.  Furthermore, the shell pipeline is just about as
4017 efficient as the C program would be, and it is much more efficient in
4018 terms of programmer time.  People time is much more expensive than
4019 computer time, and in our modern ``there's never enough time to do
4020 everything'' society, saving two hours of programmer time is no mean
4021 feat.
4022
4023 Second, it is also important to emphasize that with the
4024 @emph{combination} of the tools, it is possible to do a special
4025 purpose job never imagined by the authors of the individual programs.
4026
4027 Third, it is also valuable to build up your pipeline in stages, as we did here.
4028 This allows you to view the data at each stage in the pipeline, which helps
4029 you acquire the confidence that you are indeed using these tools correctly.
4030
4031 Finally, by bundling the pipeline in a shell script, other users can use
4032 your command, without having to remember the fancy plumbing you set up for
4033 them. In terms of how you run them, shell scripts and compiled programs are
4034 indistinguishable.
4035
4036 After the previous warm-up exercise, we'll look at two additional, more
4037 complicated pipelines.  For them, we need to introduce two more tools.
4038
4039 The first is the @code{tr} command, which stands for ``transliterate.''
4040 The @code{tr} command works on a character-by-character basis, changing
4041 characters. Normally it is used for things like mapping upper case to
4042 lower case:
4043
4044 @example
4045 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]'
4046 this example has mixed case!
4047 @end example
4048
4049 There are several options of interest:
4050
4051 @table @samp
4052 @item -c
4053 work on the complement of the listed characters, i.e.,
4054 operations apply to characters not in the given set
4055
4056 @item -d
4057 delete characters in the first set from the output
4058
4059 @item -s
4060 squeeze repeated characters in the output into just one character.
4061 @end table
4062
4063 We will be using all three options in a moment.
4064
4065 The other command we'll look at is @code{comm}.  The @code{comm}
4066 command takes two sorted input files as input data, and prints out the
4067 files' lines in three columns.  The output columns are the data lines
4068 unique to the first file, the data lines unique to the second file, and
4069 the data lines that are common to both.  The @samp{-1}, @samp{-2}, and
4070 @samp{-3} command line options omit the respective columns. (This is
4071 non-intuitive and takes a little getting used to.)  For example:
4072
4073 @example
4074 $ cat f1
4075 11111
4076 22222
4077 33333
4078 44444
4079 $ cat f2
4080 00000
4081 22222
4082 33333
4083 55555
4084 $ comm f1 f2
4085         00000
4086 11111
4087                 22222
4088                 33333
4089 44444
4090         55555
4091 @end example
4092
4093 The single dash as a filename tells @code{comm} to read standard input
4094 instead of a regular file.
4095
4096 Now we're ready to build a fancy pipeline.  The first application is a word
4097 frequency counter.  This helps an author determine if he or she is over-using
4098 certain words.
4099
4100 The first step is to change the case of all the letters in our input file
4101 to one case.  ``The'' and ``the'' are the same word when doing counting.
4102
4103 @example
4104 $ tr '[A-Z]' '[a-z]' < whats.gnu | ...
4105 @end example
4106
4107 The next step is to get rid of punctuation.  Quoted words and unquoted words
4108 should be treated identically; it's easiest to just get the punctuation out of
4109 the way.
4110
4111 @example
4112 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
4113 @end example
4114
4115 The second @code{tr} command operates on the complement of the listed
4116 characters, which are all the letters, the digits, the underscore, and
4117 the blank.  The @samp{\012} represents the newline character; it has to
4118 be left alone.  (The @sc{ascii} tab character should also be included for
4119 good measure in a production script.)
4120
4121 At this point, we have data consisting of words separated by blank space.
4122 The words only contain alphanumeric characters (and the underscore).  The
4123 next step is break the data apart so that we have one word per line. This
4124 makes the counting operation much easier, as we will see shortly.
4125
4126 @example
4127 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4128 > tr -s '[ ]' '\012' | ...
4129 @end example
4130
4131 This command turns blanks into newlines.  The @samp{-s} option squeezes
4132 multiple newline characters in the output into just one.  This helps us
4133 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
4134 This is what the shell prints when it notices you haven't finished
4135 typing in all of a command.)
4136
4137 We now have data consisting of one word per line, no punctuation, all one
4138 case.  We're ready to count each word:
4139
4140 @example
4141 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4142 > tr -s '[ ]' '\012' | sort | uniq -c | ...
4143 @end example
4144
4145 At this point, the data might look something like this:
4146
4147 @example
4148   60 a
4149    2 able
4150    6 about
4151    1 above
4152    2 accomplish
4153    1 acquire
4154    1 actually
4155    2 additional
4156 @end example
4157
4158 The output is sorted by word, not by count!  What we want is the most
4159 frequently used words first.  Fortunately, this is easy to accomplish,
4160 with the help of two more @code{sort} options:
4161
4162 @table @samp
4163 @item -n
4164 do a numeric sort, not a textual one
4165
4166 @item -r
4167 reverse the order of the sort
4168 @end table
4169
4170 The final pipeline looks like this:
4171
4172 @example
4173 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4174 > tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
4175  156 the
4176   60 a
4177   58 to
4178   51 of
4179   51 and
4180  ...
4181 @end example
4182
4183 Whew!  That's a lot to digest.  Yet, the same principles apply. With six
4184 commands, on two lines (really one long one split for convenience), we've
4185 created a program that does something interesting and useful, in much
4186 less time than we could have written a C program to do the same thing.
4187
4188 A minor modification to the above pipeline can give us a simple spelling
4189 checker!  To determine if you've spelled a word correctly, all you have to
4190 do is look it up in a dictionary.  If it is not there, then chances are
4191 that your spelling is incorrect.  So, we need a dictionary.  If you
4192 have the Slackware Linux distribution, you have the file
4193 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
4194 dictionary.
4195
4196 Now, how to compare our file with the dictionary?  As before, we generate
4197 a sorted list of words, one per line:
4198
4199 @example
4200 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4201 > tr -s '[ ]' '\012' | sort -u | ...
4202 @end example
4203
4204 Now, all we need is a list of words that are @emph{not} in the
4205 dictionary.  Here is where the @code{comm} command comes in.
4206
4207 @example
4208 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4209 > tr -s '[ ]' '\012' | sort -u |
4210 > comm -23 - /usr/lib/ispell/ispell.words
4211 @end example
4212
4213 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
4214 dictionary (the second file), and lines that are in both files.  Lines
4215 only in the first file (standard input, our stream of words), are
4216 words that are not in the dictionary.  These are likely candidates for
4217 spelling errors.  This pipeline was the first cut at a production
4218 spelling checker on Unix.
4219
4220 There are some other tools that deserve brief mention.
4221
4222 @table @code
4223 @item grep
4224 search files for text that matches a regular expression
4225
4226 @item egrep
4227 like @code{grep}, but with more powerful regular expressions
4228
4229 @item wc
4230 count lines, words, characters
4231
4232 @item tee
4233 a T-fitting for data pipes, copies data to files and to standard output
4234
4235 @item sed
4236 the stream editor, an advanced tool
4237
4238 @item awk
4239 a data manipulation language, another advanced tool
4240 @end table
4241
4242 The software tools philosophy also espoused the following bit of
4243 advice: ``Let someone else do the hard part.'' This means, take
4244 something that gives you most of what you need, and then massage it the
4245 rest of the way until it's in the form that you want.
4246
4247 To summarize:
4248
4249 @enumerate 1
4250 @item
4251 Each program should do one thing well. No more, no less.
4252
4253 @item
4254 Combining programs with appropriate plumbing leads to results where
4255 the whole is greater than the sum of the parts.  It also leads to novel
4256 uses of programs that the authors might never have imagined.
4257
4258 @item
4259 Programs should never print extraneous header or trailer data, since these
4260 could get sent on down a pipeline. (A point we didn't mention earlier.)
4261
4262 @item
4263 Let someone else do the hard part.
4264
4265 @item
4266 Know your toolbox! Use each program appropriately. If you don't have an
4267 appropriate tool, build one.
4268 @end enumerate
4269
4270 As of this writing, all the programs we've discussed are available via
4271 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
4272 @file{/pub/gnu/textutils-1.9.tar.gz}.@footnote{Version 1.9 was current
4273 when this column was written. Check the nearest GNU archive for the
4274 current version.  The main GNU FTP site is now @code{ftp.gnu.org}.}
4275
4276 None of what I have presented in this column is new. The Software Tools
4277 philosophy was first introduced in the book @cite{Software Tools},
4278 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
4279 0-201-03669-X).   This book showed how to write and use software
4280 tools.   It was written in 1976, using a preprocessor for FORTRAN named
4281 @code{ratfor} (RATional FORtran).  At the time, C was not as ubiquitous
4282 as it is now; FORTRAN was.  The last chapter presented a @code{ratfor}
4283 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
4284 awful lot like C; if you know C, you won't have any problem following
4285 the code.
4286
4287 In 1981, the book was updated and made available as @cite{Software
4288 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7).  Both books
4289 remain in print, and are well worth reading if you're a programmer.
4290 They certainly made a major change in how I view programming.
4291
4292 Initially, the programs in both books were available (on 9-track tape)
4293 from Addison-Wesley.  Unfortunately, this is no longer the case,
4294 although you might be able to find copies floating around the Internet.
4295 For a number of years, there was an active Software Tools Users Group,
4296 whose members had ported the original @code{ratfor} programs to essentially
4297 every computer system with a FORTRAN compiler.  The popularity of the
4298 group waned in the middle '80s as Unix began to spread beyond universities.
4299
4300 With the current proliferation of GNU code and other clones of Unix programs,
4301 these programs now receive little attention; modern C versions are
4302 much more efficient and do more than these programs do.  Nevertheless, as
4303 exposition of good programming style, and evangelism for a still-valuable
4304 philosophy, these books are unparalleled, and I recommend them highly.
4305
4306 Acknowledgment: I would like to express my gratitude to Brian Kernighan
4307 of Bell Labs, the original Software Toolsmith, for reviewing this column.
4308
4309
4310 @node Index
4311 @unnumbered Index
4312
4313 @printindex cp
4314
4315 @contents
4316 @bye
4317
4318 @c Local variables:
4319 @c texinfo-column-for-description: 32
4320 @c End: