3 @setfilename textutils.info
4 @settitle @sc{gnu} text utilities
8 @include constants.texi
10 @c Define new indices.
13 @c Put everything in one index (arbitrarily chosen to be the concept index).
23 * Text utilities: (textutils). GNU text utilities.
24 * cat: (textutils)cat invocation. Concatenate and write files.
25 * cksum: (textutils)cksum invocation. Print @sc{posix} CRC checksum.
26 * comm: (textutils)comm invocation. Compare sorted files by line.
27 * csplit: (textutils)csplit invocation. Split by context.
28 * cut: (textutils)cut invocation. Print selected parts of lines.
29 * expand: (textutils)expand invocation. Convert tabs to spaces.
30 * fmt: (textutils)fmt invocation. Reformat paragraph text.
31 * fold: (textutils)fold invocation. Wrap long input lines.
32 * head: (textutils)head invocation. Output the first part of files.
33 * join: (textutils)join invocation. Join lines on a common field.
34 * md5sum: (textutils)md5sum invocation. Print or check message-digests.
35 * nl: (textutils)nl invocation. Number lines and write files.
36 * od: (textutils)od invocation. Dump files in octal, etc.
37 * paste: (textutils)paste invocation. Merge lines of files.
38 * pr: (textutils)pr invocation. Paginate or columnate files.
39 * ptx: (textutils)ptx invocation. Produce permuted indexes.
40 * sort: (textutils)sort invocation. Sort text files.
41 * split: (textutils)split invocation. Split into fixed-size pieces.
42 * sum: (textutils)sum invocation. Print traditional checksum.
43 * tac: (textutils)tac invocation. Reverse files.
44 * tail: (textutils)tail invocation. Output the last part of files.
45 * tsort: (textutils)tsort invocation. Topological sort.
46 * tr: (textutils)tr invocation. Translate characters.
47 * unexpand: (textutils)unexpand invocation. Convert spaces to tabs.
48 * uniq: (textutils)uniq invocation. Uniquify files.
49 * wc: (textutils)wc invocation. Byte, word, and line counts.
55 This file documents the GNU text utilities.
57 Copyright (C) 1994, 95, 96 Free Software Foundation, Inc.
59 Permission is granted to copy, distribute and/or modify this document
60 under the terms of the GNU Free Documentation License, Version 1.1
61 or any later version published by the Free Software Foundation;
62 with no Invariant Sections, with no
63 Front-Cover Texts, and with no Back-Cover Texts.
64 A copy of the license is included in the section entitled ``GNU
65 Free Documentation License''.
70 @title @sc{gnu} @code{textutils}
71 @subtitle A set of text utilities
72 @subtitle for version @value{VERSION}, @value{UPDATED}
73 @author David MacKenzie et al.
76 @vskip 0pt plus 1filll
77 Copyright @copyright{} 1994, 95, 96 Free Software Foundation, Inc.
79 Permission is granted to copy, distribute and/or modify this document
80 under the terms of the GNU Free Documentation License, Version 1.1
81 or any later version published by the Free Software Foundation;
82 with no Invariant Sections, with no
83 Front-Cover Texts, and with no Back-Cover Texts.
84 A copy of the license is included in the section entitled ``GNU
85 Free Documentation License''.
89 @c If your makeinfo doesn't grok this @ifnottex directive, then either
90 @c get a newer version of makeinfo or do s/ifnottex/ifinfo/ here and on
91 @c the matching @end directive below.
94 @top GNU text utilities
96 @cindex text utilities
97 @cindex utilities for text handling
99 This manual documents version @value{VERSION} of the @sc{gnu} text utilities.
102 * Introduction:: Caveats, overview, and authors.
103 * Common options:: Common options.
104 * Output of entire files:: cat tac nl od
105 * Formatting file contents:: fmt pr fold
106 * Output of parts of files:: head tail split csplit
107 * Summarizing files:: wc sum cksum md5sum
108 * Operating on sorted files:: sort uniq comm ptx tsort
109 * Operating on fields within a line:: cut paste join
110 * Operating on characters:: tr expand unexpand
111 * Opening the software toolbox:: The software tools philosophy.
112 * Index:: General index.
115 --- The Detailed Node Listing ---
117 Output of entire files
119 * cat invocation:: Concatenate and write files.
120 * tac invocation:: Concatenate and write files in reverse.
121 * nl invocation:: Number lines and write files.
122 * od invocation:: Write files in octal or other formats.
124 Formatting file contents
126 * fmt invocation:: Reformat paragraph text.
127 * pr invocation:: Paginate or columnate files for printing.
128 * fold invocation:: Wrap input lines to fit in specified width.
130 Output of parts of files
132 * head invocation:: Output the first part of files.
133 * tail invocation:: Output the last part of files.
134 * split invocation:: Split a file into fixed-size pieces.
135 * csplit invocation:: Split a file into context-determined pieces.
139 * wc invocation:: Print byte, word, and line counts.
140 * sum invocation:: Print checksum and block counts.
141 * cksum invocation:: Print CRC checksum and byte counts.
142 * md5sum invocation:: Print or check message-digests.
144 Operating on sorted files
146 * sort invocation:: Sort text files.
147 * uniq invocation:: Uniquify files.
148 * comm invocation:: Compare two sorted files line by line.
149 * ptx invocation:: Produce a permuted index of file contents.
150 * tsort invocation:: Topological sort.
152 @code{ptx}: Produce permuted indexes
154 * General options in ptx:: Options which affect general program behavior.
155 * Charset selection in ptx:: Underlying character set considerations.
156 * Input processing in ptx:: Input fields, contexts, and keyword selection.
157 * Output formatting in ptx:: Types of output format, and sizing the fields.
158 * Compatibility in ptx:: The GNU extensions to @code{ptx}
160 Operating on fields within a line
162 * cut invocation:: Print selected parts of lines.
163 * paste invocation:: Merge lines of files.
164 * join invocation:: Join lines on a common field.
166 Operating on characters
168 * tr invocation:: Translate, squeeze, and/or delete characters.
169 * expand invocation:: Convert tabs to spaces.
170 * unexpand invocation:: Convert spaces to tabs.
172 @code{tr}: Translate, squeeze, and/or delete characters
174 * Character sets:: Specifying sets of characters.
175 * Translating:: Changing one characters to another.
176 * Squeezing:: Squeezing repeats and deleting.
177 * Warnings in tr:: Warning messages.
179 Opening the software toolbox
181 * Toolbox introduction:: Toolbox introduction
182 * I/O redirection:: I/O redirection
183 * The who command:: The @code{who} command
184 * The cut command:: The @code{cut} command
185 * The sort command:: The @code{sort} command
186 * The uniq command:: The @code{uniq} command
187 * Putting the tools together:: Putting the tools together
196 @chapter Introduction
200 This manual is incomplete: No attempt is made to explain basic concepts
201 in a way suitable for novices. Thus, if you are interested, please get
202 involved in improving this manual. The entire @sc{gnu} community will
206 The @sc{gnu} text utilities are mostly compatible with the @sc{posix.2}
209 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
210 @c sh-utils.texi too -- so be sure to keep them consistent.
211 @cindex bugs, reporting
212 Please report bugs to @email{bug-textutils@@gnu.org}. Remember
213 to include the version number, machine architecture, input files, and
214 any other information needed to reproduce the bug: your input, what you
215 expected, what you got, and why it is wrong. Diffs are welcome, but
216 please include a description of the problem as well, since this is
217 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
219 This manual was originally derived from the Unix man pages in the
220 distribution, which were written by David MacKenzie and updated by Jim
221 Meyering. What you are reading now is the authoritative documentation
222 for these utilities; the man pages are no longer being maintained.
223 The original @code{fmt} man page was written by Ross Paterson.
224 Fran@,{c}ois Pinard did the initial conversion to Texinfo format.
225 Karl Berry did the indexing, some reorganization, and editing of the results.
226 Richard Stallman contributed his usual invaluable insights to the
231 @chapter Common options
233 @cindex common options
235 Certain options are available in all of these programs. Rather than
236 writing identical descriptions for each of the programs, they are
237 described here. (In fact, every @sc{gnu} program accepts (or should accept)
240 Some of these programs recognize the @samp{--help} and @samp{--version}
241 options only when one of them is the sole command line argument.
248 Print a usage message listing all available options, then exit successfully.
252 @cindex version number, finding
253 Print the version number, then exit successfully.
258 @node Output of entire files
259 @chapter Output of entire files
261 @cindex output of entire files
262 @cindex entire files, output of
264 These commands read and write entire files, possibly transforming them
268 * cat invocation:: Concatenate and write files.
269 * tac invocation:: Concatenate and write files in reverse.
270 * nl invocation:: Number lines and write files.
271 * od invocation:: Write files in octal or other formats.
275 @section @code{cat}: Concatenate and write files
278 @cindex concatenate and write files
279 @cindex copying files
281 @code{cat} copies each @var{file} (@samp{-} means standard input), or
282 standard input if none are given, to standard output. Synopsis:
285 cat [@var{option}] [@var{file}]@dots{}
288 The program accepts the following options. Also see @ref{Common options}.
296 Equivalent to @samp{-vET}.
302 @cindex binary and text I/O in cat
303 On MS-DOS and MS-Windows only, read and write the files in binary mode.
304 By default, @code{cat} on MS-DOS/MS-Windows uses binary mode only when
305 standard output is redirected to a file or a pipe; this option overrides
306 that. Binary file I/O is used so that the files retain their format
307 (Unix text as opposed to DOS text and binary), because @code{cat} is
308 frequently used as a file-copying program. Some options (see below)
309 cause @code{cat} to read and write files in text mode because in those
310 cases the original file contents aren't important (e.g., when lines are
311 numbered by @code{cat}, or when line endings should be marked). This is
312 so these options work as DOS/Windows users would expect; for example,
313 DOS-style text files have their lines end with the CR-LF pair of
314 characters, which won't be processed as an empty line by @samp{-b} unless
315 the file is read in text mode.
318 @itemx --number-nonblank
320 @opindex --number-nonblank
321 Number all nonblank output lines, starting with 1. On MS-DOS and
322 MS-Windows, this option causes @code{cat} to read and write files in
327 Equivalent to @samp{-vE}.
333 Display a @samp{$} after the end of each line. On MS-DOS and
334 MS-Windows, this option causes @code{cat} to read and write files in
341 Number all output lines, starting with 1. On MS-DOS and MS-Windows,
342 this option causes @code{cat} to read and write files in text mode.
345 @itemx --squeeze-blank
347 @opindex --squeeze-blank
348 @cindex squeezing blank lines
349 Replace multiple adjacent blank lines with a single blank line. On
350 MS-DOS and MS-Windows, this option causes @code{cat} to read and write
355 Equivalent to @samp{-vT}.
361 Display TAB characters as @samp{^I}.
365 Ignored; for Unix compatibility.
368 @itemx --show-nonprinting
370 @opindex --show-nonprinting
371 Display control characters except for LFD and TAB using
372 @samp{^} notation and precede characters that have the high bit set with
373 @samp{M-}. On MS-DOS and MS-Windows, this option causes @code{cat} to
374 read files and standard input in DOS binary mode, so the CR
375 characters at the end of each line are also visible.
381 @section @code{tac}: Concatenate and write files in reverse
384 @cindex reversing files
386 @code{tac} copies each @var{file} (@samp{-} means standard input), or
387 standard input if none are given, to standard output, reversing the
388 records (lines by default) in each separately. Synopsis:
391 tac [@var{option}]@dots{} [@var{file}]@dots{}
394 @dfn{Records} are separated by instances of a string (newline by
395 default). By default, this separator string is attached to the end of
396 the record that it follows in the file.
398 The program accepts the following options. Also see @ref{Common options}.
406 The separator is attached to the beginning of the record that it
407 precedes in the file.
413 Treat the separator string as a regular expression. Users of @code{tac}
414 on MS-DOS/MS-Windows should note that, since @code{tac} reads files in
415 binary mode, each line of a text file might end with a CR/LF pair
416 instead of the Unix-style LF.
418 @item -s @var{separator}
419 @itemx --separator=@var{separator}
422 Use @var{separator} as the record separator, instead of newline.
428 @section @code{nl}: Number lines and write files
431 @cindex numbering lines
432 @cindex line numbering
434 @code{nl} writes each @var{file} (@samp{-} means standard input), or
435 standard input if none are given, to standard output, with line numbers
436 added to some or all of the lines. Synopsis:
439 nl [@var{option}]@dots{} [@var{file}]@dots{}
442 @cindex logical pages, numbering on
443 @code{nl} decomposes its input into (logical) pages; by default, the
444 line number is reset to 1 at the top of each logical page. @code{nl}
445 treats all of the input files as a single document; it does not reset
446 line numbers or logical pages between files.
448 @cindex headers, numbering
449 @cindex body, numbering
450 @cindex footers, numbering
451 A logical page consists of three sections: header, body, and footer.
452 Any of the sections can be empty. Each can be numbered in a different
453 style from the others.
455 The beginnings of the sections of logical pages are indicated in the
456 input file by a line containing exactly one of these delimiter strings:
467 The two characters from which these strings are made can be changed from
468 @samp{\} and @samp{:} via options (see below), but the pattern and
469 length of each string cannot be changed.
471 A section delimiter is replaced by an empty line on output. Any text
472 that comes before the first section delimiter string in the input file
473 is considered to be part of a body section, so @code{nl} treats a
474 file that contains no section delimiters as a single body section.
476 The program accepts the following options. Also see @ref{Common options}.
481 @itemx --body-numbering=@var{style}
483 @opindex --body-numbering
484 Select the numbering style for lines in the body section of each
485 logical page. When a line is not numbered, the current line number
486 is not incremented, but the line number separator character is still
487 prepended to the line. The styles are:
493 number only nonempty lines (default for body),
495 do not number lines (default for header and footer),
497 number only lines that contain a match for @var{regexp}.
501 @itemx --section-delimiter=@var{cd}
503 @opindex --section-delimiter
504 @cindex section delimiters of pages
505 Set the section delimiter characters to @var{cd}; default is
506 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
507 (Remember to protect @samp{\} or other metacharacters from shell
508 expansion with quotes or extra backslashes.)
511 @itemx --footer-numbering=@var{style}
513 @opindex --footer-numbering
514 Analogous to @samp{--body-numbering}.
517 @itemx --header-numbering=@var{style}
519 @opindex --header-numbering
520 Analogous to @samp{--body-numbering}.
522 @item -i @var{number}
523 @itemx --page-increment=@var{number}
525 @opindex --page-increment
526 Increment line numbers by @var{number} (default 1).
528 @item -l @var{number}
529 @itemx --join-blank-lines=@var{number}
531 @opindex --join-blank-lines
532 @cindex empty lines, numbering
533 @cindex blank lines, numbering
534 Consider @var{number} (default 1) consecutive empty lines to be one
535 logical line for numbering, and only number the last one. Where fewer
536 than @var{number} consecutive empty lines occur, do not number them.
537 An empty line is one that contains no characters, not even spaces
540 @item -n @var{format}
541 @itemx --number-format=@var{format}
543 @opindex --number-format
544 Select the line numbering format (default is @code{rn}):
548 @opindex ln @r{format for @code{nl}}
549 left justified, no leading zeros;
551 @opindex rn @r{format for @code{nl}}
552 right justified, no leading zeros;
554 @opindex rz @r{format for @code{nl}}
555 right justified, leading zeros.
561 @opindex --no-renumber
562 Do not reset the line number at the start of a logical page.
564 @item -s @var{string}
565 @itemx --number-separator=@var{string}
567 @opindex --number-separator
568 Separate the line number from the text line in the output with
569 @var{string} (default is the TAB character).
571 @item -v @var{number}
572 @itemx --starting-line-number=@var{number}
574 @opindex --starting-line-number
575 Set the initial line number on each logical page to @var{number} (default 1).
577 @item -w @var{number}
578 @itemx --number-width=@var{number}
580 @opindex --number-width
581 Use @var{number} characters for line numbers (default 6).
587 @section @code{od}: Write files in octal or other formats
590 @cindex octal dump of files
591 @cindex hex dump of files
592 @cindex ASCII dump of files
593 @cindex file contents, dumping unambiguously
595 @code{od} writes an unambiguous representation of each @var{file}
596 (@samp{-} means standard input), or standard input if none are given.
600 od [@var{option}]@dots{} [@var{file}]@dots{}
601 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
604 Each line of output consists of the offset in the input, followed by
605 groups of data from the file. By default, @code{od} prints the offset in
606 octal, and each group of file data is two bytes of input printed as a
609 The program accepts the following options. Also see @ref{Common options}.
614 @itemx --address-radix=@var{radix}
616 @opindex --address-radix
617 @cindex radix for file offsets
618 @cindex file offset radix
619 Select the base in which file offsets are printed. @var{radix} can
620 be one of the following:
630 none (do not print offsets).
633 The default is octal.
636 @itemx --skip-bytes=@var{bytes}
638 @opindex --skip-bytes
639 Skip @var{bytes} input bytes before formatting and writing. If
640 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
641 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
642 in decimal. Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
643 by 1024, and @samp{m} by 1048576.
646 @itemx --read-bytes=@var{bytes}
648 @opindex --read-bytes
649 Output at most @var{bytes} bytes of the input. Prefixes and suffixes on
650 @code{bytes} are interpreted as for the @samp{-j} option.
653 @itemx --strings[=@var{n}]
656 @cindex string constants, outputting
657 Instead of the normal output, output only @dfn{string constants}: at
658 least @var{n} (3 by default) consecutive @sc{ascii} graphic characters,
659 followed by a null (zero) byte.
662 @itemx --format=@var{type}
665 Select the format in which to output the file data. @var{type} is a
666 string of one or more of the below type indicator characters. If you
667 include more than one type indicator character in a single @var{type}
668 string, or use this option more than once, @code{od} writes one copy
669 of each output line using each of the data types that you specified,
670 in the order that you specified.
672 Adding a trailing ``z'' to any type specification appends a display
673 of the @sc{ascii} character representation of the printable characters
674 to the output line generated by the type specification.
680 @sc{ascii} character or backslash escape,
693 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
694 newline, and @samp{nul} for a null (zero) byte. Type @code{c} outputs
695 @samp{ }, @samp{\n}, and @code{\0}, respectively.
698 Except for types @samp{a} and @samp{c}, you can specify the number
699 of bytes to use in interpreting each number in the given data type
700 by following the type indicator character with a decimal integer.
701 Alternately, you can specify the size of one of the C compiler's
702 built-in data types by following the type indicator character with
703 one of the following characters. For integers (@samp{d}, @samp{o},
717 For floating point (@code{f}):
729 @itemx --output-duplicates
731 @opindex --output-duplicates
732 Output consecutive lines that are identical. By default, when two or
733 more consecutive output lines would be identical, @code{od} outputs only
734 the first line, and puts just an asterisk on the following line to
735 indicate the elision.
738 @itemx --width[=@var{n}]
741 Dump @code{n} input bytes per output line. This must be a multiple of
742 the least common multiple of the sizes associated with the specified
743 output types. If @var{n} is omitted, the default is 32. If this option
744 is not given at all, the default is 16.
748 The next several options map the old, pre-@sc{posix} format specification
749 options to the corresponding @sc{posix} format specs.
750 @sc{gnu} @code{od} accepts
751 any combination of old- and new-style options. Format specification
758 Output as named characters. Equivalent to @samp{-ta}.
762 Output as octal bytes. Equivalent to @samp{-toC}.
766 Output as @sc{ascii} characters or backslash escapes. Equivalent to
771 Output as unsigned decimal shorts. Equivalent to @samp{-tu2}.
775 Output as floats. Equivalent to @samp{-tfF}.
779 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
783 Output as decimal shorts. Equivalent to @samp{-td2}.
787 Output as decimal longs. Equivalent to @samp{-td4}.
791 Output as octal shorts. Equivalent to @samp{-to2}.
795 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
799 @opindex --traditional
800 Recognize the pre-@sc{posix} non-option arguments that traditional @code{od}
801 accepted. The following syntax:
804 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
808 can be used to specify at most one file and optional arguments
809 specifying an offset and a pseudo-start address, @var{label}. By
810 default, @var{offset} is interpreted as an octal number specifying how
811 many input bytes to skip before formatting and writing. The optional
812 trailing decimal point forces the interpretation of @var{offset} as a
813 decimal number. If no decimal is specified and the offset begins with
814 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number. If
815 there is a trailing @samp{b}, the number of bytes skipped will be
816 @var{offset} multiplied by 512. The @var{label} argument is interpreted
817 just like @var{offset}, but it specifies an initial pseudo-address. The
818 pseudo-addresses are displayed in parentheses following any normal
824 @node Formatting file contents
825 @chapter Formatting file contents
827 @cindex formatting file contents
829 These commands reformat the contents of files.
832 * fmt invocation:: Reformat paragraph text.
833 * pr invocation:: Paginate or columnate files for printing.
834 * fold invocation:: Wrap input lines to fit in specified width.
839 @section @code{fmt}: Reformat paragraph text
842 @cindex reformatting paragraph text
843 @cindex paragraphs, reformatting
844 @cindex text, reformatting
846 @code{fmt} fills and joins lines to produce output lines of (at most)
847 a given number of characters (75 by default). Synopsis:
850 fmt [@var{option}]@dots{} [@var{file}]@dots{}
853 @code{fmt} reads from the specified @var{file} arguments (or standard
854 input if none are given), and writes to standard output.
856 By default, blank lines, spaces between words, and indentation are
857 preserved in the output; successive input lines with different
858 indentation are not joined; tabs are expanded on input and introduced on
861 @cindex line-breaking
862 @cindex sentences and line-breaking
863 @cindex Knuth, Donald E.
864 @cindex Plass, Michael F.
865 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
866 avoid line breaks after the first word of a sentence or before the last
867 word of a sentence. A @dfn{sentence break} is defined as either the end
868 of a paragraph or a word ending in any of @samp{.?!}, followed by two
869 spaces or end of line, ignoring any intervening parentheses or quotes.
870 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
871 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
872 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
873 and Experience}, 11 (1981), 1119--1184).
875 The program accepts the following options. Also see @ref{Common options}.
880 @itemx --crown-margin
882 @opindex --crown-margin
884 @dfn{Crown margin} mode: preserve the indentation of the first two
885 lines within a paragraph, and align the left margin of each subsequent
886 line with that of the second line.
889 @itemx --tagged-paragraph
891 @opindex --tagged-paragraph
892 @cindex tagged paragraphs
893 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
894 indentation of the first line of a paragraph is the same as the
895 indentation of the second, the first line is treated as a one-line
901 @opindex --split-only
902 Split lines only. Do not join short lines to form longer ones. This
903 prevents sample lines of code, and other such ``formatted'' text from
904 being unduly combined.
907 @itemx --uniform-spacing
909 @opindex --uniform-spacing
910 Uniform spacing. Reduce spacing between words to one space, and spacing
911 between sentences to two spaces.
914 @itemx -w @var{width}
915 @itemx --width=@var{width}
916 @opindex -@var{width}
919 Fill output lines up to @var{width} characters (default 75). @code{fmt}
920 initially tries to make lines about 7% shorter than this, to give it
921 room to balance line lengths.
923 @item -p @var{prefix}
924 @itemx --prefix=@var{prefix}
925 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
926 are subject to formatting. The prefix and any preceding whitespace are
927 stripped for the formatting and then re-attached to each formatted output
928 line. One use is to format certain kinds of program comments, while
929 leaving the code unchanged.
935 @section @code{pr}: Paginate or columnate files for printing
938 @cindex printing, preparing files for
939 @cindex multicolumn output, generating
940 @cindex merging files in parallel
942 @code{pr} writes each @var{file} (@samp{-} means standard input), or
943 standard input if none are given, to standard output, paginating and
944 optionally outputting in multicolumn format; optionally merges all
945 @var{file}s, printing all in parallel, one per column. Synopsis:
948 pr [@var{option}]@dots{} [@var{file}]@dots{}
951 By default, a 5-line header is printed at each page: two blank lines;
952 a line with the date, the filename, and the page count; and two more
953 blank lines. A footer of five blank lines is also printed. With the @samp{-F}
954 option, a 3-line header is printed: the leading two blank lines are
955 omitted; no footer is used. The default @var{page_length} in both cases is 66
956 lines. The default number of text lines changes from 56 (without @samp{-F})
957 to 63 (with @samp{-F}). The text line of the header takes up the full
958 @var{page_width} in the form @samp{yyyy-mm-dd HH:MM string Page nnnn}.
959 String is a centered header string.
961 Form feeds in the input cause page breaks in the output. Multiple form
962 feeds produce empty pages.
964 Columns are of equal width, separated by an optional string (default
965 is @samp{space}). For multicolumn output, lines will always be truncated to
966 @var{page_width} (default 72), unless you use the @samp{-J} option. For single
967 column output no line truncation occurs by default. Use @samp{-W} option to
968 truncate lines in that case.
970 The following changes were made in version 1.22i and apply to later
971 versions of @command{pr}:
972 @c FIXME: this whole section here sounds very awkward to me. I
973 @c made a few small changes, but really it all needs to be redone. - Brian
974 @c OK, I fixed another sentence or two, but some of it I just don't understand.
979 Some small @var{letter options} (@samp{-s}, @samp{-w}) have been
980 redefined for better @sc{posix} compliance. The output of some further
981 cases has been adapted to other Unix systems. These changes are not
982 compatible with earlier versions of the program.
985 Some @var{new capital letter} options (@samp{-J}, @samp{-S}, @samp{-W})
986 have been introduced to turn off unexpected interferences of small letter
987 options. The @samp{-N} option and the second argument @var{last_page}
988 of @samp{+FIRST_PAGE} offer more flexibility. The detailed handling of
989 form feeds set in the input files requires the @samp{-T} option.
992 Capital letter options override small letter ones.
995 Some of the option-arguments (compare @samp{-s}, @samp{-S}, @samp{-e},
996 @samp{-i}, @samp{-n}) cannot be specified as separate arguments from the
997 preceding option letter (already stated in the @sc{posix} specification).
1000 The program accepts the following options. Also see @ref{Common options}.
1004 @item +@var{first_page}[:@var{last_page}]
1005 @itemx --pages=@var{first_page}[:@var{last_page}]
1006 @opindex +@var{first_page}[:@var{last_page}]
1008 Begin printing with page @var{first_page} and stop with @var{last_page}.
1009 Missing @samp{:@var{last_page}} implies end of file. While estimating
1010 the number of skipped pages each form feed in the input file results
1011 in a new page. Page counting with and without @samp{+@var{first_page}}
1012 is identical. By default, counting starts with the first page of input
1013 file (not first page printed). Line numbering may be altered by @samp{-N}
1017 @itemx --columns=@var{column}
1018 @opindex -@var{column}
1020 @cindex down columns
1021 With each single @var{file}, produce @var{column} columns of output
1022 (default is 1) and print columns down, unless @samp{-a} is used. The
1023 column width is automatically decreased as @var{column} increases; unless
1024 you use the @samp{-W/-w} option to increase @var{page_width} as well.
1025 This option might well cause some lines to be truncated. The number of
1026 lines in the columns on each page are balanced. The options @samp{-e}
1027 and @samp{-i} are on for multiple text-column output. Together with
1028 @samp{-J} option column alignment and line truncation is turned off.
1029 Lines of full length are joined in a free field format and @samp{-S}
1030 option may set field separators. @samp{-@var{column}} may not be used
1031 with @samp{-m} option.
1037 @cindex across columns
1038 With each single @var{file}, print columns across rather than down. The
1039 @samp{-@var{column}} option must be given with @var{column} greater than one.
1040 If a line is too long to fit in a column, it is truncated.
1043 @itemx --show-control-chars
1045 @opindex --show-control-chars
1046 Print control characters using hat notation (e.g., @samp{^G}); print
1047 other unprintable characters in octal backslash notation. By default,
1048 unprintable characters are not changed.
1051 @itemx --double-space
1053 @opindex --double-space
1054 @cindex double spacing
1055 Double space the output.
1057 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
1058 @itemx --expand-tabs[=@var{in-tabchar}[@var{in-tabwidth}]]
1060 @opindex --expand-tabs
1062 Expand @var{tab}s to spaces on input. Optional argument @var{in-tabchar} is
1063 the input tab character (default is the TAB character). Second optional
1064 argument @var{in-tabwidth} is the input tab character's width (default
1072 @opindex --form-feed
1073 Use a form feed instead of newlines to separate output pages. The default
1074 page length of 66 lines is not altered. But the number of lines of text
1075 per page changes from default 56 to 63 lines.
1077 @item -h @var{HEADER}
1078 @itemx --header=@var{HEADER}
1081 Replace the filename in the header with the centered string @var{header}.
1082 Left-hand-side truncation (marked by a @samp{*}) may occur if the total
1083 header line @samp{yyyy-mm-dd HH:MM HEADER Page nnnn} becomes larger than
1084 @var{page_width}. @samp{-h ""} prints a blank line header. Don't use
1086 A space between the @samp{-h} option and the argument is always
1089 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
1090 @itemx --output-tabs[=@var{out-tabchar}[@var{out-tabwidth}]]
1092 @opindex --output-tabs
1094 Replace spaces with @var{tab}s on output. Optional argument @var{out-tabchar}
1095 is the output tab character (default is the TAB character). Second optional
1096 argument @var{out-tabwidth} is the output tab character's width (default
1102 @opindex --join-lines
1103 Merge lines of full length. Used together with the column options
1104 @samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}. Turns off
1105 @samp{-W/-w} line truncation;
1106 no column alignment used; may be used with @samp{-S[@var{string}]}.
1107 @samp{-J} has been introduced (together with @samp{-W} and @samp{-S})
1108 to disentangle the old (@sc{posix}-compliant) options @samp{-w} and
1109 @samp{-s} along with the three column options.
1112 @item -l @var{page_length}
1113 @itemx --length=@var{page_length}
1116 Set the page length to @var{page_length} (default 66) lines, including
1117 the lines of the header [and the footer]. If @var{page_length} is less
1118 than or equal to 10 (or <= 3 with @samp{-F}), the header and footer are
1119 omitted, and all form feeds set in input files are eliminated, as if
1120 the @samp{-T} option had been given.
1126 Merge and print all @var{file}s in parallel, one in each column. If a
1127 line is too long to fit in a column, it is truncated, unless the @samp{-J}
1128 option is used. @samp{-S[@var{string}]} may be used. Empty pages in
1129 some @var{file}s (form feeds set) produce empty columns, still marked
1130 by @var{string}. The result is a continuous line numbering and column
1131 marking throughout the whole merged file. Completely empty merged pages
1132 show no separators or line numbers. The default header becomes
1133 @samp{yyyy-mm-dd HH:MM <blanks> Page nnnn}; may be used with
1134 @samp{-h @var{header}} to fill up the middle blank part.
1136 @item -n[@var{number-separator}[@var{digits}]]
1137 @itemx --number-lines[=@var{number-separator}[@var{digits}]]
1139 @opindex --number-lines
1140 Provide @var{digits} digit line numbering (default for @var{digits} is
1141 5). With multicolumn output the number occupies the first @var{digits}
1142 column positions of each text column or only each line of @samp{-m}
1143 output. With single column output the number precedes each line just as
1144 @samp{-m} does. Default counting of the line numbers starts with the
1145 first line of the input file (not the first line printed, compare the
1146 @samp{--page} option and @samp{-N} option).
1147 Optional argument @var{number-separator} is the character appended to
1148 the line number to separate it from the text followed. The default
1149 separator is the TAB character. In a strict sense a TAB is always
1150 printed with single column output only. The @var{TAB}-width varies
1151 with the @var{TAB}-position, e.g. with the left @var{margin} specified
1152 by @samp{-o} option. With multicolumn output priority is given to
1153 @samp{equal width of output columns} (a @sc{posix} specification).
1154 The @var{TAB}-width is fixed to the value of the first column and does
1155 not change with different values of left @var{margin}. That means a
1156 fixed number of spaces is always printed in the place of the
1157 @var{number-separator tab}. The tabification depends upon the output
1160 @item -N @var{line_number}
1161 @itemx --first-line-number=@var{line_number}
1163 @opindex --first-line-number
1164 Start line counting with the number @var{line_number} at first line of
1165 first page printed (in most cases not the first line of the input file).
1167 @item -o @var{margin}
1168 @itemx --indent=@var{margin}
1171 @cindex indenting lines
1173 Indent each line with a margin @var{margin} spaces wide (default is zero).
1174 The total page width is the size of the margin plus the @var{page_width}
1175 set with the @samp{-W/-w} option. A limited overflow may occur with
1176 numbered single column output (compare @samp{-n} option).
1179 @itemx --no-file-warnings
1181 @opindex --no-file-warnings
1182 Do not print a warning message when an argument @var{file} cannot be
1183 opened. (The exit status will still be nonzero, however.)
1185 @item -s[@var{char}]
1186 @itemx --separator[=@var{char}]
1188 @opindex --separator
1189 Separate columns by a single character @var{char}. The default for
1190 @var{char} is the TAB character without @samp{-w} and @samp{no
1191 character} with @samp{-w}. Without @samp{-s} the default separator
1192 @samp{space} is set. @samp{-s[char]} turns off line truncation of all
1193 three column options (@samp{-COLUMN}|@samp{-a -COLUMN}|@samp{-m}) unless
1194 @samp{-w} is set. This is a @sc{posix}-compliant formulation.
1197 @item -S[@var{string}]
1198 @itemx --sep-string[=@var{string}]
1200 @opindex --sep-string
1201 Use @var{string} to separate output columns. The @samp{-S} option doesn't
1202 affect the @samp{-W/-w} option, unlike the @samp{-s} option which does. It
1203 does not affect line truncation or column alignment.
1204 Without @samp{-S}, and with @samp{-J}, @code{pr} uses the default output
1206 Without @samp{-S} or @samp{-J}, @code{pr} uses a @samp{space}
1207 (same as @samp{-S" "}).
1208 Using @samp{-S} with no @var{string} is equivalent to @samp{-S""}.
1209 Note that for some of @code{pr}'s options the single-letter option
1210 character must be followed immediately by any corresponding argument;
1211 there may not be any intervening white space.
1212 @samp{-S/-s} is one of them. Don't use @samp{-S "STRING"}.
1213 @sc{posix} requires this.
1216 @itemx --omit-header
1218 @opindex --omit-header
1219 Do not print the usual header [and footer] on each page, and do not fill
1220 out the bottom of pages (with blank lines or a form feed). No page
1221 structure is produced, but form feeds set in the input files are retained.
1222 The predefined pagination is not changed. @samp{-t} or @samp{-T} may be
1223 useful together with other options; e.g.: @samp{-t -e4}, expand TAB characters
1224 in the input file to 4 spaces but don't make any other changes. Use of
1225 @samp{-t} overrides @samp{-h}.
1228 @itemx --omit-pagination
1230 @opindex --omit-pagination
1231 Do not print header [and footer]. In addition eliminate all form feeds
1232 set in the input files.
1235 @itemx --show-nonprinting
1237 @opindex --show-nonprinting
1238 Print unprintable characters in octal backslash notation.
1240 @item -w @var{page_width}
1241 @itemx --width=@var{page_width}
1244 Set page width to @var{page_width} characters for multiple text-column
1245 output only (default for @var{page_width} is 72). @samp{-s[CHAR]} turns
1246 off the default page width and any line truncation and column alignment.
1247 Lines of full length are merged, regardless of the column options
1248 set. No @var{page_width} setting is possible with single column output.
1249 A @sc{posix}-compliant formulation.
1251 @item -W @var{page_width}
1252 @itemx --page_width=@var{page_width}
1254 @opindex --page_width
1255 Set the page width to @var{page_width} characters. That's valid with and
1256 without a column option. Text lines are truncated, unless @samp{-J}
1257 is used. Together with one of the three column options
1258 (@samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}) column
1259 alignment is always used. The separator options @samp{-S} or @samp{-s}
1260 don't affect the @samp{-W} option. Default is 72 characters. Without
1261 @samp{-W @var{page_width}} and without any of the column options NO line
1262 truncation is used (defined to keep downward compatibility and to meet
1263 most frequent tasks). That's equivalent to @samp{-W 72 -J}. With and
1264 without @samp{-W @var{page_width}} the header line is always truncated
1265 to avoid line overflow.
1270 @node fold invocation
1271 @section @code{fold}: Wrap input lines to fit in specified width
1274 @cindex wrapping long input lines
1275 @cindex folding long input lines
1277 @code{fold} writes each @var{file} (@samp{-} means standard input), or
1278 standard input if none are given, to standard output, breaking long
1282 fold [@var{option}]@dots{} [@var{file}]@dots{}
1285 By default, @code{fold} breaks lines wider than 80 columns. The output
1286 is split into as many lines as necessary.
1288 @cindex screen columns
1289 @code{fold} counts screen columns by default; thus, a tab may count more
1290 than one column, backspace decreases the column count, and carriage
1291 return sets the column to zero.
1293 The program accepts the following options. Also see @ref{Common options}.
1301 Count bytes rather than columns, so that tabs, backspaces, and carriage
1302 returns are each counted as taking up one column, just like other
1309 Break at word boundaries: the line is broken after the last blank before
1310 the maximum line length. If the line contains no such blanks, the line
1311 is broken at the maximum line length as usual.
1313 @item -w @var{width}
1314 @itemx --width=@var{width}
1317 Use a maximum line length of @var{width} columns instead of 80.
1322 @node Output of parts of files
1323 @chapter Output of parts of files
1325 @cindex output of parts of files
1326 @cindex parts of files, output of
1328 These commands output pieces of the input.
1331 * head invocation:: Output the first part of files.
1332 * tail invocation:: Output the last part of files.
1333 * split invocation:: Split a file into fixed-size pieces.
1334 * csplit invocation:: Split a file into context-determined pieces.
1337 @node head invocation
1338 @section @code{head}: Output the first part of files
1341 @cindex initial part of files, outputting
1342 @cindex first part of files, outputting
1344 @code{head} prints the first part (10 lines by default) of each
1345 @var{file}; it reads from standard input if no files are given or
1346 when given a @var{file} of @samp{-}. Synopses:
1349 head [@var{option}]@dots{} [@var{file}]@dots{}
1350 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1353 If more than one @var{file} is specified, @code{head} prints a
1354 one-line header consisting of
1356 ==> @var{file name} <==
1359 before the output for each @var{file}.
1361 @code{head} accepts two option formats: the new one, in which numbers
1362 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1363 the number precedes any option letters (@samp{-1q}).
1365 The program accepts the following options. Also see @ref{Common options}.
1369 @item -@var{count}@var{options}
1370 @opindex -@var{count}
1371 This option is only recognized if it is specified first. @var{count} is
1372 a decimal number optionally followed by a size letter (@samp{b},
1373 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1374 or other option letters (@samp{cqv}).
1376 @item -c @var{bytes}
1377 @itemx --bytes=@var{bytes}
1380 Print the first @var{bytes} bytes, instead of initial lines. Appending
1381 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1385 @itemx --lines=@var{n}
1388 Output the first @var{n} lines.
1396 Never print file name headers.
1402 Always print file name headers.
1407 @node tail invocation
1408 @section @code{tail}: Output the last part of files
1411 @cindex last part of files, outputting
1413 @code{tail} prints the last part (10 lines by default) of each
1414 @var{file}; it reads from standard input if no files are given or
1415 when given a @var{file} of @samp{-}. Synopses:
1418 tail [@var{option}]@dots{} [@var{file}]@dots{}
1419 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1420 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1423 If more than one @var{file} is specified, @code{tail} prints a
1424 one-line header consisting of
1426 ==> @var{file name} <==
1429 before the output for each @var{file}.
1431 @cindex BSD @code{tail}
1432 @sc{gnu} @code{tail} can output any amount of data (some other versions of
1433 @code{tail} cannot). It also has no @samp{-r} option (print in
1434 reverse), since reversing a file is really a different job from printing
1435 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1436 only reverse files that are at most as large as its buffer, which is
1437 typically 32k. A more reliable and versatile way to reverse files is
1438 the @sc{gnu} @code{tac} command.
1440 @code{tail} accepts two option formats: the new one, in which numbers
1441 are arguments to the options (@samp{-n 1}), and the old one, in which
1442 the number precedes any option letters (@samp{-1} or @samp{+1}).
1444 If any option-argument is a number @var{n} starting with a @samp{+},
1445 @code{tail} begins printing with the @var{n}th item from the start of
1446 each file, instead of from the end.
1448 The program accepts the following options. Also see @ref{Common options}.
1454 @opindex -@var{count}
1455 @opindex +@var{count}
1456 This option is only recognized if it is specified first. @var{count} is
1457 a decimal number optionally followed by a size letter (@samp{b},
1458 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1459 or other option letters (@samp{cfqv}).
1461 @item -c @var{bytes}
1462 @itemx --bytes=@var{bytes}
1465 Output the last @var{bytes} bytes, instead of final lines. Appending
1466 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1470 @itemx --follow[=@var{how}]
1473 @cindex growing files
1474 @vindex name @r{follow option}
1475 @vindex descriptor @r{follow option}
1476 Loop forever trying to read more characters at the end of the file,
1477 presumably because the file is growing. This option is ignored when
1478 reading from a pipe.
1479 If more than one file is given, @code{tail} prints a header whenever it
1480 gets output from a different file, to indicate which file that output is
1483 There are two ways to specify how you'd like to track files with this option,
1484 but that difference is noticeable only when a followed file is removed or
1486 If you'd like to continue to track the end of a growing file even after
1487 it has been unlinked, use @samp{--follow=descriptor}. This is the default
1488 behavior, but it is not useful if you're tracking a log file that may be
1489 rotated (removed or renamed, then reopened). In that case, use
1490 @samp{--follow=name} to track the named file by reopening it periodically
1491 to see if it has been removed and recreated by some other program.
1493 No matter which method you use, if the tracked file is determined to have
1494 shrunk, @code{tail} prints a message saying the file has been truncated
1495 and resumes tracking the end of the file from the newly-determined endpoint.
1497 When a file is removed, @code{tail}'s behavior depends on whether it is
1498 following the name or the descriptor. When following by name, tail can
1499 detect that a file has been removed and gives a message to that effect,
1500 and if @samp{--retry} has been specified it will continue checking
1501 periodically to see if the file reappears.
1502 When following a descriptor, tail does not detect that the file has
1503 been unlinked or renamed and issues no message; even though the file
1504 may no longer be accessible via its original name, it may still be
1507 The option values @samp{descriptor} and @samp{name} may be specified only
1508 with the long form of the option, not with @samp{-f}.
1512 This option is meaningful only when following by name.
1513 Without this option, when tail encounters a file that doesn't
1514 exist or is otherwise inaccessible, it reports that fact and
1515 never checks it again.
1517 @itemx --sleep-interval=@var{n}
1518 @opindex --sleep-interval
1519 Change the number of seconds to wait between iterations (the default is 1).
1520 During one iteration, every specified file is checked to see if it has
1523 @itemx --pid=@var{pid}
1525 When following by name or by descriptor, you may specify the process ID,
1526 @var{pid}, of the sole writer of all @var{file} arguments. Then, shortly
1527 after that process terminates, tail will also terminate. This will
1528 work properly only if the writer and the tailing process are running on
1529 the same machine. For example, to save the output of a build in a file
1530 and to watch the file grow, if you invoke @code{make} and @code{tail}
1531 like this then the tail process will stop when your build completes.
1532 Without this option, you would have had to kill the @code{tail -f}
1535 $ make >& makerr & tail --pid=$! -f makerr
1537 If you specify a @var{pid} that is not in use or that does not correspond
1538 to the process that is writing to the tailed files, then @code{tail}
1539 may terminate long before any @var{file}s stop growing or it may not
1540 terminate until long after the real writer has terminated.
1541 Note that @samp{--pid} cannot be supported on some systems; @code{tail}
1542 will print a warning if this is the case.
1544 @itemx --max-unchanged-stats=@var{n}
1545 @opindex --max-unchanged-stats
1546 When tailing a file by name, if there have been @var{n} (default
1547 N=@value{DEFAULT_MAX_N_UNCHANGED_STATS_BETWEEN_OPENS}) consecutive
1548 iterations for which the size has remained the same, then
1549 @code{open}/@code{fstat} the file to determine if that file name is
1550 still associated with the same device/inode-number pair as before.
1551 When following a log file that is rotated, this is approximately the
1552 number of seconds between when tail prints the last pre-rotation lines
1553 and when it prints the lines that have accumulated in the new log file.
1554 This option is meaningful only when following by name.
1557 @itemx --lines=@var{n}
1560 Output the last @var{n} lines.
1568 Never print file name headers.
1574 Always print file name headers.
1579 @node split invocation
1580 @section @code{split}: Split a file into fixed-size pieces
1583 @cindex splitting a file into pieces
1584 @cindex pieces, splitting a file into
1586 @code{split} creates output files containing consecutive sections of
1587 @var{input} (standard input if none is given or @var{input} is
1588 @samp{-}). Synopsis:
1591 split [@var{option}] [@var{input} [@var{prefix}]]
1594 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1595 left over for the last section), into each output file.
1597 @cindex output file name prefix
1598 The output files' names consist of @var{prefix} (@samp{x} by default)
1599 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1600 that concatenating the output files in sorted order by file name produces
1601 the original input file. (If more than 676 output files are required,
1602 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1604 The program accepts the following options. Also see @ref{Common options}.
1609 @itemx -l @var{lines}
1610 @itemx --lines=@var{lines}
1613 Put @var{lines} lines of @var{input} into each output file.
1615 @item -b @var{bytes}
1616 @itemx --bytes=@var{bytes}
1619 Put the first @var{bytes} bytes of @var{input} into each output file.
1620 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1621 @samp{m} by 1048576.
1623 @item -C @var{bytes}
1624 @itemx --line-bytes=@var{bytes}
1626 @opindex --line-bytes
1627 Put into each output file as many complete lines of @var{input} as
1628 possible without exceeding @var{bytes} bytes. For lines longer than
1629 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1630 less than @var{bytes} bytes of the line are left, then continue
1631 normally. @var{bytes} has the same format as for the @samp{--bytes}
1636 Write a diagnostic to standard error just before each output file is opened.
1641 @node csplit invocation
1642 @section @code{csplit}: Split a file into context-determined pieces
1645 @cindex context splitting
1646 @cindex splitting a file into pieces by context
1648 @code{csplit} creates zero or more output files containing sections of
1649 @var{input} (standard input if @var{input} is @samp{-}). Synopsis:
1652 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1655 The contents of the output files are determined by the @var{pattern}
1656 arguments, as detailed below. An error occurs if a @var{pattern}
1657 argument refers to a nonexistent line of the input file (e.g., if no
1658 remaining line matches a given regular expression). After every
1659 @var{pattern} has been matched, any remaining input is copied into one
1662 By default, @code{csplit} prints the number of bytes written to each
1663 output file after it has been created.
1665 The types of pattern arguments are:
1670 Create an output file containing the input up to but not including line
1671 @var{n} (a positive integer). If followed by a repeat count, also
1672 create an output file containing the next @var{line} lines of the input
1673 file once for each repeat.
1675 @item /@var{regexp}/[@var{offset}]
1676 Create an output file containing the current line up to (but not
1677 including) the next line of the input file that contains a match for
1678 @var{regexp}. The optional @var{offset} is a @samp{+} or @samp{-}
1679 followed by a positive integer. If it is given, the input up to the
1680 matching line plus or minus @var{offset} is put into the output file,
1681 and the line after that begins the next section of input.
1683 @item %@var{regexp}%[@var{offset}]
1684 Like the previous type, except that it does not create an output
1685 file, so that section of the input file is effectively ignored.
1687 @item @{@var{repeat-count}@}
1688 Repeat the previous pattern @var{repeat-count} additional
1689 times. @var{repeat-count} can either be a positive integer or an
1690 asterisk, meaning repeat as many times as necessary until the input is
1695 The output files' names consist of a prefix (@samp{xx} by default)
1696 followed by a suffix. By default, the suffix is an ascending sequence
1697 of two-digit decimal numbers from @samp{00} to @samp{99}. In any case,
1698 concatenating the output files in sorted order by filename produces the
1699 original input file.
1701 By default, if @code{csplit} encounters an error or receives a hangup,
1702 interrupt, quit, or terminate signal, it removes any output files
1703 that it has created so far before it exits.
1705 The program accepts the following options. Also see @ref{Common options}.
1709 @item -f @var{prefix}
1710 @itemx --prefix=@var{prefix}
1713 @cindex output file name prefix
1714 Use @var{prefix} as the output file name prefix.
1716 @item -b @var{suffix}
1717 @itemx --suffix=@var{suffix}
1720 @cindex output file name suffix
1721 Use @var{suffix} as the output file name suffix. When this option is
1722 specified, the suffix string must include exactly one
1723 @code{printf(3)}-style conversion specification, possibly including
1724 format specification flags, a field width, a precision specifications,
1725 or all of these kinds of modifiers. The format letter must convert a
1726 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1727 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed. The
1728 entire @var{suffix} is given (with the current output file number) to
1729 @code{sprintf(3)} to form the file name suffixes for each of the
1730 individual output files in turn. If this option is used, the
1731 @samp{--digits} option is ignored.
1733 @item -n @var{digits}
1734 @itemx --digits=@var{digits}
1737 Use output file names containing numbers that are @var{digits} digits
1738 long instead of the default 2.
1743 @opindex --keep-files
1744 Do not remove output files when errors are encountered.
1747 @itemx --elide-empty-files
1749 @opindex --elide-empty-files
1750 Suppress the generation of zero-length output files. (In cases where
1751 the section delimiters of the input file are supposed to mark the first
1752 lines of each of the sections, the first output file will generally be a
1753 zero-length file unless you use this option.) The output file sequence
1754 numbers always run consecutively starting from 0, even when this option
1765 Do not print counts of output file sizes.
1770 @node Summarizing files
1771 @chapter Summarizing files
1773 @cindex summarizing files
1775 These commands generate just a few numbers representing entire
1779 * wc invocation:: Print byte, word, and line counts.
1780 * sum invocation:: Print checksum and block counts.
1781 * cksum invocation:: Print CRC checksum and byte counts.
1782 * md5sum invocation:: Print or check message-digests.
1787 @section @code{wc}: Print byte, word, and line counts
1791 @cindex character count
1795 @code{wc} counts the number of bytes, characters, whitespace-separated
1796 words, and newlines in each given @var{file}, or standard input if none
1797 are given or for a @var{file} of @samp{-}. Synopsis:
1800 wc [@var{option}]@dots{} [@var{file}]@dots{}
1803 @cindex total counts
1804 @vindex POSIXLY_CORRECT
1805 @code{wc} prints one line of counts for each file, and if the file was
1806 given as an argument, it prints the file name following the counts. If
1807 more than one @var{file} is given, @code{wc} prints a final line
1808 containing the cumulative counts, with the file name @file{total}. The
1809 counts are printed in this order: newlines, words, characters, bytes.
1810 By default, each count is output right-justified in a 7-byte field with
1811 one space between fields so that the numbers and file names line up nicely
1812 in columns. However, @sc{posix} requires that there be exactly one space
1813 separating columns. You can make @code{wc} use the @sc{posix}-mandated
1814 output format by setting the @env{POSIXLY_CORRECT} environment variable.
1816 By default, @code{wc} prints three counts: the newline, words, and byte
1817 counts. Options can specify that only certain counts be printed.
1818 Options do not undo others previously given, so
1825 prints both the byte counts and the word counts.
1827 With the @code{--max-line-length} option, @code{wc} prints the length
1828 of the longest line per file, and if there is more than one file it
1829 prints the maximum (not the sum) of those lengths.
1831 The program accepts the following options. Also see @ref{Common options}.
1839 Print only the byte counts.
1845 Print only the character counts.
1851 Print only the word counts.
1857 Print only the newline counts.
1860 @itemx --max-line-length
1862 @opindex --max-line-length
1863 Print only the maximum line lengths.
1868 @node sum invocation
1869 @section @code{sum}: Print checksum and block counts
1872 @cindex 16-bit checksum
1873 @cindex checksum, 16-bit
1875 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1876 standard input if none are given or for a @var{file} of @samp{-}. Synopsis:
1879 sum [@var{option}]@dots{} [@var{file}]@dots{}
1882 @code{sum} prints the checksum for each @var{file} followed by the
1883 number of blocks in the file (rounded up). If more than one @var{file}
1884 is given, file names are also printed (by default). (With the
1885 @samp{--sysv} option, corresponding file names are printed when there is
1886 at least one file argument.)
1888 By default, @sc{gnu} @code{sum} computes checksums using an algorithm
1889 compatible with BSD @code{sum} and prints file sizes in units of
1892 The program accepts the following options. Also see @ref{Common options}.
1898 @cindex BSD @code{sum}
1899 Use the default (BSD compatible) algorithm. This option is included for
1900 compatibility with the System V @code{sum}. Unless @samp{-s} was also
1901 given, it has no effect.
1907 @cindex System V @code{sum}
1908 Compute checksums using an algorithm compatible with System V
1909 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1913 @code{sum} is provided for compatibility; the @code{cksum} program (see
1914 next section) is preferable in new applications.
1917 @node cksum invocation
1918 @section @code{cksum}: Print CRC checksum and byte counts
1921 @cindex cyclic redundancy check
1922 @cindex CRC checksum
1924 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1925 given @var{file}, or standard input if none are given or for a
1926 @var{file} of @samp{-}. Synopsis:
1929 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1932 @code{cksum} prints the CRC checksum for each file along with the number
1933 of bytes in the file, and the filename unless no arguments were given.
1935 @code{cksum} is typically used to ensure that files
1936 transferred by unreliable means (e.g., netnews) have not been corrupted,
1937 by comparing the @code{cksum} output for the received files with the
1938 @code{cksum} output for the original files (typically given in the
1941 The CRC algorithm is specified by the @sc{posix.2} standard. It is not
1942 compatible with the BSD or System V @code{sum} algorithms (see the
1943 previous section); it is more robust.
1945 The only options are @samp{--help} and @samp{--version}. @xref{Common
1949 @node md5sum invocation
1950 @section @code{md5sum}: Print or check message-digests
1953 @cindex 128-bit checksum
1954 @cindex checksum, 128-bit
1955 @cindex fingerprint, 128-bit
1956 @cindex message-digest, 128-bit
1958 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1959 @dfn{message-digest}) for each specified @var{file}.
1960 If a @var{file} is specified as @samp{-} or if no files are given
1961 @code{md5sum} computes the checksum for the standard input.
1962 @code{md5sum} can also determine whether a file and checksum are
1963 consistent. Synopses:
1966 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1967 md5sum [@var{option}]@dots{} --check [@var{file}]
1970 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1971 indicating a binary or text input file, and the filename.
1972 If @var{file} is omitted or specified as @samp{-}, standard input is read.
1974 The program accepts the following options. Also see @ref{Common options}.
1982 @cindex binary input files
1983 Treat all input files as binary. This option has no effect on Unix
1984 systems, since they don't distinguish between binary and text files.
1985 This option is useful on systems that have different internal and
1986 external character representations. On MS-DOS and MS-Windows, this is
1991 Read filenames and checksum information from the single @var{file}
1992 (or from stdin if no @var{file} was specified) and report whether
1993 each named file and the corresponding checksum data are consistent.
1994 The input to this mode of @code{md5sum} is usually the output of
1995 a prior, checksum-generating run of @samp{md5sum}.
1996 Each valid line of input consists of an MD5 checksum, a binary/text
1997 flag, and then a filename.
1998 Binary files are marked with @samp{*}, text with @samp{ }.
1999 For each such line, @code{md5sum} reads the named file and computes its
2000 MD5 checksum. Then, if the computed message digest does not match the
2001 one on the line with the filename, the file is noted as having
2002 failed the test. Otherwise, the file passes the test.
2003 By default, for each valid line, one line is written to standard
2004 output indicating whether the named file passed the test.
2005 After all checks have been performed, if there were any failures,
2006 a warning is issued to standard error.
2007 Use the @samp{--status} option to inhibit that output.
2008 If any listed file cannot be opened or read, if any valid line has
2009 an MD5 checksum inconsistent with the associated file, or if no valid
2010 line is found, @code{md5sum} exits with nonzero status. Otherwise,
2011 it exits successfully.
2015 @cindex verifying MD5 checksums
2016 This option is useful only when verifying checksums.
2017 When verifying checksums, don't generate the default one-line-per-file
2018 diagnostic and don't output the warning summarizing any failures.
2019 Failures to open or read a file still evoke individual diagnostics to
2021 If all listed files are readable and are consistent with the associated
2022 MD5 checksums, exit successfully. Otherwise exit with a status code
2023 indicating there was a failure.
2029 @cindex text input files
2030 Treat all input files as text files. This is the reverse of
2037 @cindex verifying MD5 checksums
2038 When verifying checksums, warn about improperly formatted MD5 checksum lines.
2039 This option is useful only if all but a few lines in the checked input
2045 @node Operating on sorted files
2046 @chapter Operating on sorted files
2048 @cindex operating on sorted files
2049 @cindex sorted files, operations on
2051 These commands work with (or produce) sorted files.
2054 * sort invocation:: Sort text files.
2055 * uniq invocation:: Uniquify files.
2056 * comm invocation:: Compare two sorted files line by line.
2057 * ptx invocation:: Produce a permuted index of file contents.
2058 * tsort invocation:: Topological sort.
2062 @node sort invocation
2063 @section @code{sort}: Sort text files
2066 @cindex sorting files
2068 @code{sort} sorts, merges, or compares all the lines from the given
2069 files, or standard input if none are given or for a @var{file} of
2070 @samp{-}. By default, @code{sort} writes the results to standard
2074 sort [@var{option}]@dots{} [@var{file}]@dots{}
2077 @code{sort} has three modes of operation: sort (the default), merge,
2078 and check for sortedness. The following options change the operation
2085 @cindex checking for sortedness
2086 Check whether the given files are already sorted: if they are not all
2087 sorted, print an error message and exit with a status of 1.
2088 Otherwise, exit successfully.
2092 @cindex merging sorted files
2093 Merge the given files by sorting them as a group. Each input file must
2094 always be individually sorted. It always works to sort instead of
2095 merge; merging is provided because it is faster, in the case where it
2101 A pair of lines is compared as follows: if any key fields have been
2102 specified, @code{sort} compares each pair of fields, in the order
2103 specified on the command line, according to the associated ordering
2104 options, until a difference is found or no fields are left.
2105 Unless otherwise specified, all comparisons use the character
2106 collating sequence specified by the @env{LC_COLLATE} locale.
2108 If any of the global options @samp{Mbdfinr} are given but no key fields
2109 are specified, @code{sort} compares the entire lines according to the
2112 Finally, as a last resort when all keys compare equal (or if no
2113 ordering options were specified at all), @code{sort} compares the entire
2114 lines. The last resort comparison
2115 honors the @samp{-r} global option. The @samp{-s} (stable) option
2116 disables this last-resort comparison so that lines in which all fields
2117 compare equal are left in their original relative order. If no fields
2118 or global options are specified, @samp{-s} has no effect.
2120 @sc{gnu} @code{sort} (as specified for all @sc{gnu} utilities) has no limits on
2121 input line length or restrictions on bytes allowed within lines. In
2122 addition, if the final byte of an input file is not a newline, @sc{gnu}
2123 @code{sort} silently supplies one. A line's trailing newline is not
2124 part of the line for comparison purposes.@footnote{@sc{posix}.2-1992
2125 requires that the trailing newline be part of the comparison, and some
2126 @code{sort} implementations obey this requirement, but it is widely
2127 considered to be a bug in the standard and the next version of
2128 @sc{posix}.2 will likely remove this requirement.}
2130 Upon any error, @code{sort} exits with a status of @samp{2}.
2133 If the environment variable @env{TMPDIR} is set, @code{sort} uses its
2134 value as the directory for temporary files instead of @file{/tmp}. The
2135 @samp{-T @var{tempdir}} option in turn overrides the environment
2138 The following options affect the ordering of output lines. They may be
2139 specified globally or as part of a specific key field. If no key
2140 fields are specified, global options apply to comparison of entire
2141 lines; otherwise the global options are inherited by key fields that do
2142 not specify any special options of their own. In pre-@sc{posix}
2143 versions of @command{sort}, global options affect only later key fields,
2144 so portable shell scripts should specify global options first.
2150 @cindex blanks, ignoring leading
2152 Ignore leading blanks when finding sort keys in each line.
2153 The @env{LC_CTYPE} locale determines character types.
2157 @cindex phone directory order
2158 @cindex telephone directory order
2160 Sort in @dfn{phone directory} order: ignore all characters except
2161 letters, digits and blanks when sorting.
2162 The @env{LC_CTYPE} locale determines character types.
2166 @cindex case folding
2168 Fold lowercase characters into the equivalent uppercase characters when
2169 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
2170 The @env{LC_CTYPE} locale determines character types.
2174 @cindex general numeric sort
2176 Sort numerically, using the standard C function @code{strtod} to convert
2177 a prefix of each line to a double-precision floating point number.
2178 This allows floating point numbers to be specified in scientific notation,
2179 like @code{1.0e-34} and @code{10e100}.
2180 The @env{LC_NUMERIC} locale determines the decimal-point character.
2181 Do not report overflow, underflow, or conversion errors.
2182 Use the following collating sequence:
2186 Lines that do not start with numbers (all considered to be equal).
2188 NaNs (``Not a Number'' values, in IEEE floating point arithmetic)
2189 in a consistent but machine-dependent order.
2193 Finite numbers in ascending numeric order (with @math{-0} and @math{+0} equal).
2198 Use this option only if there is no alternative; it is much slower than
2199 @samp{-n} and it can lose information when converting to floating point.
2203 @cindex unprintable characters, ignoring
2205 Ignore unprintable characters.
2206 The @env{LC_CTYPE} locale determines character types.
2210 @cindex months, sorting by
2212 An initial string, consisting of any amount of whitespace, followed
2213 by a month name abbreviation, is folded to UPPER case and
2214 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
2215 Invalid names compare low to valid names. The @env{LC_TIME} locale
2216 determines the month spellings.
2220 @cindex numeric sort
2222 Sort numerically: the number begins each line; specifically, it consists
2223 of optional whitespace, an optional @samp{-} sign, and zero or more
2224 digits possibly separated by thousands separators, optionally followed
2225 by a decimal-point character and zero or more digits. The @env{LC_NUMERIC}
2226 locale specifies the decimal-point character and thousands separator.
2228 @code{sort -n} uses what might be considered an unconventional method
2229 to compare strings representing floating point numbers. Rather than
2230 first converting each string to the C @code{double} type and then
2231 comparing those values, sort aligns the decimal-point characters in the two
2232 strings and compares the strings a character at a time. One benefit
2233 of using this approach is its speed. In practice this is much more
2234 efficient than performing the two corresponding string-to-double (or even
2235 string-to-integer) conversions and then comparing doubles. In addition,
2236 there is no corresponding loss of precision. Converting each string to
2237 @code{double} before comparison would limit precision to about 16 digits
2240 Neither a leading @samp{+} nor exponential notation is recognized.
2241 To compare such strings numerically, use the @samp{-g} option.
2245 @cindex reverse sorting
2246 Reverse the result of comparison, so that lines with greater key values
2247 appear earlier in the output instead of later.
2255 @item -o @var{output-file}
2257 @cindex overwriting of input, allowed
2258 Write output to @var{output-file} instead of standard output.
2259 If @var{output-file} is one of the input files, @code{sort} copies
2260 it to a temporary file before sorting and writing the output to
2265 @cindex size for main memory sorting
2266 Use a main-memory sort buffer of the given @var{size}. By default,
2267 @var{size} is in units of 1,024 bytes. Appending @samp{%} causes
2268 @var{size} to be interpreted as a percentage of physical memory.
2269 Appending @samp{k} multiplies @var{size} by 1,024 (the default),
2270 @samp{M} by 1,048,576, @samp{G} by 1,073,741,824, and so on for
2271 @samp{T}, @samp{P}, @samp{E}, @samp{Z}, and @samp{Y}. Appending
2272 @samp{b} causes @var{size} to be interpreted as a byte count, with no
2275 This option can improve the performance of @command{sort} by causing it
2276 to start with a larger or smaller sort buffer than the default.
2277 However, this option affects only the initial buffer size. The buffer
2278 grows beyond @var{size} if @command{sort} encounters input lines larger
2281 @item -t @var{separator}
2283 @cindex field separator character
2284 Use character @var{separator} as the field separator when finding the
2285 sort keys in each line. By default, fields are separated by the empty
2286 string between a non-whitespace character and a whitespace character.
2287 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
2288 into fields @w{@samp{ foo}} and @w{@samp{ bar}}. The field separator is
2289 not considered to be part of either the field preceding or the field
2290 following. But note that sort fields that extend to the end of the line,
2291 as @samp{-k 2}, or sort fields consisting of a range, as @samp{-k 2,3},
2292 retain the field separators present between the endpoints of the range.
2294 @item -T @var{tempdir}
2296 @cindex temporary directory
2298 Use directory @var{tempdir} to store temporary files, overriding the
2299 @env{TMPDIR} environment variable. If this option is given more than
2300 once, temporary files are stored in all the directories given. If you
2301 have a large sort or merge that is I/O-bound, you can often improve
2302 performance by using this option to specify directories on different
2303 disks and controllers.
2307 @cindex uniquifying output
2308 For the default case or the @samp{-m} option, only output the first
2309 of a sequence of lines that compare equal. For the @samp{-c} option,
2310 check that no pair of consecutive lines compares equal.
2312 @item -k @var{pos1}[,@var{pos2}]
2315 The recommended, @sc{posix}, option for specifying a sort field. The field
2316 consists of the part of the line between @var{pos1} and @var{pos2} (or the
2317 end of the line, if @var{pos2} is omitted), @emph{inclusive}.
2318 Fields and character positions are numbered starting with 1.
2319 So to sort on the second field, you'd use @samp{-k 2,2}
2320 See below for more examples.
2324 @cindex sort zero-terminated lines
2325 Treat the input as a set of lines, each terminated by a zero byte (@sc{ascii}
2326 @sc{nul} (Null) character) instead of an @sc{ascii} @sc{lf} (Line Feed).
2327 This option can be useful in conjunction with @samp{perl -0} or
2328 @samp{find -print0} and @samp{xargs -0} which do the same in order to
2329 reliably handle arbitrary pathnames (even those which contain Line Feed
2332 @item +@var{pos1}[-@var{pos2}]
2333 The obsolete, traditional option for specifying a sort field. The field
2334 consists of the line between @var{pos1} and up to but @emph{not including}
2335 @var{pos2} (or the end of the line if @var{pos2} is omitted). Fields
2336 and character positions are numbered starting with 0. See below.
2340 In addition, when @sc{gnu} @code{sort} is invoked with exactly one argument,
2341 options @samp{--help} and @samp{--version} are recognized. @xref{Common
2344 Historical (BSD and System V) implementations of @code{sort} have
2345 differed in their interpretation of some options, particularly
2346 @samp{-b}, @samp{-f}, and @samp{-n}. @sc{gnu} sort follows the @sc{posix}
2347 behavior, which is usually (but not always!) like the System V behavior.
2348 According to @sc{posix}, @samp{-n} no longer implies @samp{-b}. For
2349 consistency, @samp{-M} has been changed in the same way. This may
2350 affect the meaning of character positions in field specifications in
2351 obscure cases. The only fix is to add an explicit @samp{-b}.
2353 A position in a sort field specified with the @samp{-k} or @samp{+}
2354 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
2355 of the field to use and @var{c} is the number of the first character
2356 from the beginning of the field (for @samp{+@var{pos}}) or from the end
2357 of the previous field (for @samp{-@var{pos}}). If the @samp{.@var{c}}
2358 is omitted, it is taken to be the first character in the field. If the
2359 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
2360 specification is counted from the first nonblank character of the field
2361 (for @samp{+@var{pos}}) or from the first nonblank character following
2362 the previous field (for @samp{-@var{pos}}).
2364 A sort key option may also have any of the option letters @samp{Mbdfinr}
2365 appended to it, in which case the global ordering options are not used
2366 for that particular field. The @samp{-b} option may be independently
2367 attached to either or both of the @samp{+@var{pos}} and
2368 @samp{-@var{pos}} parts of a field specification, and if it is inherited
2369 from the global options it will be attached to both.
2370 Keys may span multiple fields.
2372 Here are some examples to illustrate various combinations of options.
2373 In them, the @sc{posix} @samp{-k} option is used to specify sort keys rather
2374 than the obsolete @samp{+@var{pos1}-@var{pos2}} syntax.
2379 Sort in descending (reverse) numeric order.
2386 Sort alphabetically, omitting the first and second fields.
2387 This uses a single key composed of the characters beginning
2388 at the start of field three and extending to the end of each line.
2395 Sort numerically on the second field and resolve ties by sorting
2396 alphabetically on the third and fourth characters of field five.
2397 Use @samp{:} as the field delimiter.
2400 sort -t : -k 2,2n -k 5.3,5.4
2403 Note that if you had written @samp{-k 2} instead of @samp{-k 2,2}
2404 @command{sort} would have used all characters beginning in the second field
2405 and extending to the end of the line as the primary @emph{numeric}
2406 key. For the large majority of applications, treating keys spanning
2407 more than one field as numeric will not do what you expect.
2409 Also note that the @samp{n} modifier was applied to the field-end
2410 specifier for the first key. It would have been equivalent to
2411 specify @samp{-k 2n,2} or @samp{-k 2n,2n}. All modifiers except
2412 @samp{b} apply to the associated @emph{field}, regardless of whether
2413 the modifier character is attached to the field-start and/or the
2414 field-end part of the key specifier.
2417 Sort the password file on the fifth field and ignore any
2418 leading white space. Sort lines with equal values in field five
2419 on the numeric user ID in field three.
2422 sort -t : -k 5b,5 -k 3,3n /etc/passwd
2425 An alternative is to use the global numeric modifier @samp{-n}.
2428 sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
2432 Generate a tags file in case-insensitive sorted order.
2435 find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
2438 The use of @samp{-print0}, @samp{-z}, and @samp{-0} in this case means
2439 that pathnames that contain Line Feed characters will not get broken up
2440 by the sort operation.
2442 Finally, to ignore both leading and trailing white space, you
2443 could have applied the @samp{b} modifier to the field-end specifier
2447 sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
2450 or by using the global @samp{-b} modifier instead of @samp{-n}
2451 and an explicit @samp{n} with the second key specifier.
2454 sort -t : -b -k 5,5 -k 3,3n /etc/passwd
2457 @c This example is a bit contrived and needs more explanation.
2459 @c Sort records separated by an arbitrary string by using a pipe to convert
2460 @c each record delimiter string to @samp{\0}, then using sort's -z option,
2461 @c and converting each @samp{\0} back to the original record delimiter.
2464 @c printf 'c\n\nb\n\na\n'|perl -0pe 's/\n\n/\n\0/g'|sort -z|perl -0pe 's/\0/\n/g'
2470 @node uniq invocation
2471 @section @code{uniq}: Uniquify files
2474 @cindex uniquify files
2476 @code{uniq} writes the unique lines in the given @file{input}, or
2477 standard input if nothing is given or for an @var{input} name of
2481 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
2484 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
2485 discards all but one of identical successive lines. Optionally, it can
2486 instead show only lines that appear exactly once, or lines that appear
2489 The input must be sorted. If your input is not sorted, perhaps you want
2490 to use @code{sort -u}.
2492 If no @var{output} file is specified, @code{uniq} writes to standard
2495 The program accepts the following options. Also see @ref{Common options}.
2501 @itemx --skip-fields=@var{n}
2504 @opindex --skip-fields
2505 Skip @var{n} fields on each line before checking for uniqueness. Fields
2506 are sequences of non-space non-tab characters that are separated from
2507 each other by at least one space or tab.
2511 @itemx --skip-chars=@var{n}
2514 @opindex --skip-chars
2515 Skip @var{n} characters before checking for uniqueness. If you use both
2516 the field and character skipping options, fields are skipped over first.
2522 Print the number of times each line occurred along with the line.
2525 @itemx --ignore-case
2527 @opindex --ignore-case
2528 Ignore differences in case when comparing lines.
2534 @cindex duplicate lines, outputting
2535 Print only duplicate lines.
2538 @itemx --all-repeated
2540 @opindex --all-repeated
2541 @cindex all duplicate lines, outputting
2542 Print all duplicate lines and only duplicate lines.
2543 This option is useful mainly in conjunction with other options e.g.,
2544 to ignore case or to compare only selected fields.
2545 This is a @sc{gnu} extension.
2546 @c FIXME: give an example showing *how* it's useful
2552 @cindex unique lines, outputting
2553 Print only unique lines.
2556 @itemx --check-chars=@var{n}
2558 @opindex --check-chars
2559 Compare @var{n} characters on each line (after skipping any specified
2560 fields and characters). By default the entire rest of the lines are
2566 @node comm invocation
2567 @section @code{comm}: Compare two sorted files line by line
2570 @cindex line-by-line comparison
2571 @cindex comparing sorted files
2573 @code{comm} writes to standard output lines that are common, and lines
2574 that are unique, to two input files; a file name of @samp{-} means
2575 standard input. Synopsis:
2578 comm [@var{option}]@dots{} @var{file1} @var{file2}
2582 Before @code{comm} can be used, the input files must be sorted using the
2583 collating sequence specified by the @env{LC_COLLATE} locale.
2584 If an input file ends in a non-newline
2585 character, a newline is silently appended. The @code{sort} command with
2586 no options always outputs a file that is suitable input to @code{comm}.
2588 @cindex differing lines
2589 @cindex common lines
2590 With no options, @code{comm} produces three column output. Column one
2591 contains lines unique to @var{file1}, column two contains lines unique
2592 to @var{file2}, and column three contains lines common to both files.
2593 Columns are separated by a single TAB character.
2594 @c FIXME: when there's an option to supply an alternative separator
2595 @c string, append `by default' to the above sentence.
2600 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
2601 the corresponding columns. Also see @ref{Common options}.
2603 Unlike some other comparison utilities, @code{comm} has an exit
2604 status that does not depend on the result of the comparison.
2605 Upon normal completion @code{comm} produces an exit code of zero.
2606 If there is an error it exits with nonzero status.
2609 @node tsort invocation
2610 @section @code{tsort}: Topological sort
2613 @cindex topological sort
2615 @code{tsort} performs a topological sort on the given @var{file}, or
2616 standard input if no input file is given or for a @var{file} of
2620 tsort [@var{option}] [@var{file}]
2623 @code{tsort} reads its input as pairs of strings, separated by blanks,
2624 indicating a partial ordering. The output is a total ordering that
2625 corresponds to the given partial ordering.
2639 will produce the output
2650 @code{tsort} will detect cycles in the input and writes the first cycle
2651 encountered to standard error.
2653 Note that for a given partial ordering, generally there is no unique
2656 The only options are @samp{--help} and @samp{--version}. @xref{Common
2660 @node ptx invocation
2661 @section @code{ptx}: Produce permuted indexes
2665 @code{ptx} reads a text file and essentially produces a permuted index, with
2666 each keyword in its context. The calling sketch is either one of:
2669 ptx [@var{option} @dots{}] [@var{file} @dots{}]
2670 ptx -G [@var{option} @dots{}] [@var{input} [@var{output}]]
2673 The @samp{-G} (or its equivalent: @samp{--traditional}) option disables
2674 all @sc{gnu} extensions and reverts to traditional mode, thus introducing some
2675 limitations and changing several of the program's default option values.
2676 When @samp{-G} is not specified, @sc{gnu} extensions are always enabled.
2677 @sc{gnu} extensions to @code{ptx} are documented wherever appropriate in this
2678 document. For the full list, see @xref{Compatibility in ptx}.
2680 Individual options are explained in the following sections.
2682 When @sc{gnu} extensions are enabled, there may be zero, one or several
2683 @var{file}s after the options. If there is no @var{file}, the program
2684 reads the standard input. If there is one or several @var{file}s, they
2685 give the name of input files which are all read in turn, as if all the
2686 input files were concatenated. However, there is a full contextual
2687 break between each file and, when automatic referencing is requested,
2688 file names and line numbers refer to individual text input files. In
2689 all cases, the program outputs the permuted index to the standard
2692 When @sc{gnu} extensions are @emph{not} enabled, that is, when the program
2693 operates in traditional mode, there may be zero, one or two parameters
2694 besides the options. If there are no parameters, the program reads the
2695 standard input and outputs the permuted index to the standard output.
2696 If there is only one parameter, it names the text @var{input} to be read
2697 instead of the standard input. If two parameters are given, they give
2698 respectively the name of the @var{input} file to read and the name of
2699 the @var{output} file to produce. @emph{Be very careful} to note that,
2700 in this case, the contents of file given by the second parameter is
2701 destroyed. This behavior is dictated by System V @code{ptx}
2702 compatibility; @sc{gnu} Standards normally discourage output parameters not
2703 introduced by an option.
2705 Note that for @emph{any} file named as the value of an option or as an
2706 input text file, a single dash @kbd{-} may be used, in which case
2707 standard input is assumed. However, it would not make sense to use this
2708 convention more than once per program invocation.
2711 * General options in ptx:: Options which affect general program behavior.
2712 * Charset selection in ptx:: Underlying character set considerations.
2713 * Input processing in ptx:: Input fields, contexts, and keyword selection.
2714 * Output formatting in ptx:: Types of output format, and sizing the fields.
2715 * Compatibility in ptx::
2719 @node General options in ptx
2720 @subsection General options
2726 Print a short note about the copyright and copying conditions, then
2727 exit without further processing.
2730 @itemx --traditional
2731 As already explained, this option disables all @sc{gnu} extensions to
2732 @code{ptx} and switches to traditional mode.
2735 Print a short help on standard output, then exit without further
2739 Print the program version on standard output, then exit without further
2745 @node Charset selection in ptx
2746 @subsection Charset selection
2748 @c FIXME: People don't necessarily know what an IBM-PC was these days.
2749 As it is set up now, the program assumes that the input file is coded
2750 using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
2751 @emph{unless} it is compiled for MS-DOS, in which case it uses the
2752 character set of the IBM-PC. (@sc{gnu} @code{ptx} is not known to work on
2753 smaller MS-DOS machines anymore.) Compared to 7-bit @sc{ascii}, the set
2754 of characters which are letters is different; this alters the behavior
2755 of regular expression matching. Thus, the default regular expression
2756 for a keyword allows foreign or diacriticized letters. Keyword sorting,
2757 however, is still crude; it obeys the underlying character set ordering
2763 @itemx --ignore-case
2764 Fold lower case letters to upper case for sorting.
2769 @node Input processing in ptx
2770 @subsection Word selection and input processing
2775 @item --break-file=@var{file}
2777 This option provides an alternative (to @samp{-W}) method of describing
2778 which characters make up words. It introduces the name of a
2779 file which contains a list of characters which can@emph{not} be part of
2780 one word; this file is called the @dfn{Break file}. Any character which
2781 is not part of the Break file is a word constituent. If both options
2782 @samp{-b} and @samp{-W} are specified, then @samp{-W} has precedence and
2783 @samp{-b} is ignored.
2785 When @sc{gnu} extensions are enabled, the only way to avoid newline as a
2786 break character is to write all the break characters in the file with no
2787 newline at all, not even at the end of the file. When @sc{gnu} extensions
2788 are disabled, spaces, tabs and newlines are always considered as break
2789 characters even if not included in the Break file.
2792 @itemx --ignore-file=@var{file}
2794 The file associated with this option contains a list of words which will
2795 never be taken as keywords in concordance output. It is called the
2796 @dfn{Ignore file}. The file contains exactly one word in each line; the
2797 end of line separation of words is not subject to the value of the
2800 There is a default Ignore file used by @code{ptx} when this option is
2801 not specified, usually found in @file{/usr/local/lib/eign} if this has
2802 not been changed at installation time. If you want to deactivate the
2803 default Ignore file, specify @code{/dev/null} instead.
2806 @itemx --only-file=@var{file}
2808 The file associated with this option contains a list of words which will
2809 be retained in concordance output; any word not mentioned in this file
2810 is ignored. The file is called the @dfn{Only file}. The file contains
2811 exactly one word in each line; the end of line separation of words is
2812 not subject to the value of the @samp{-S} option.
2814 There is no default for the Only file. When both an Only file and an
2815 Ignore file are specified, a word is considered a keyword only
2816 if it is listed in the Only file and not in the Ignore file.
2821 On each input line, the leading sequence of non-white space characters will be
2822 taken to be a reference that has the purpose of identifying this input
2823 line in the resulting permuted index. For more information about reference
2824 production, see @xref{Output formatting in ptx}.
2825 Using this option changes the default value for option @samp{-S}.
2827 Using this option, the program does not try very hard to remove
2828 references from contexts in output, but it succeeds in doing so
2829 @emph{when} the context ends exactly at the newline. If option
2830 @samp{-r} is used with @samp{-S} default value, or when @sc{gnu} extensions
2831 are disabled, this condition is always met and references are completely
2832 excluded from the output contexts.
2834 @item -S @var{regexp}
2835 @itemx --sentence-regexp=@var{regexp}
2837 This option selects which regular expression will describe the end of a
2838 line or the end of a sentence. In fact, this regular expression is not
2839 the only distinction between end of lines or end of sentences, and input
2840 line boundaries have no special significance outside this option. By
2841 default, when @sc{gnu} extensions are enabled and if @samp{-r} option is not
2842 used, end of sentences are used. In this case, this @var{regex} is
2843 imported from @sc{gnu} Emacs:
2846 [.?!][]\"')@}]*\\($\\|\t\\| \\)[ \t\n]*
2849 Whenever @sc{gnu} extensions are disabled or if @samp{-r} option is used, end
2850 of lines are used; in this case, the default @var{regexp} is just:
2856 Using an empty @var{regexp} is equivalent to completely disabling end of
2857 line or end of sentence recognition. In this case, the whole file is
2858 considered to be a single big line or sentence. The user might want to
2859 disallow all truncation flag generation as well, through option @samp{-F
2860 ""}. @xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs
2863 When the keywords happen to be near the beginning of the input line or
2864 sentence, this often creates an unused area at the beginning of the
2865 output context line; when the keywords happen to be near the end of the
2866 input line or sentence, this often creates an unused area at the end of
2867 the output context line. The program tries to fill those unused areas
2868 by wrapping around context in them; the tail of the input line or
2869 sentence is used to fill the unused area on the left of the output line;
2870 the head of the input line or sentence is used to fill the unused area
2871 on the right of the output line.
2873 As a matter of convenience to the user, many usual backslashed escape
2874 sequences from the C language are recognized and converted to the
2875 corresponding characters by @code{ptx} itself.
2877 @item -W @var{regexp}
2878 @itemx --word-regexp=@var{regexp}
2880 This option selects which regular expression will describe each keyword.
2881 By default, if @sc{gnu} extensions are enabled, a word is a sequence of
2882 letters; the @var{regexp} used is @samp{\w+}. When @sc{gnu} extensions are
2883 disabled, a word is by default anything which ends with a space, a tab
2884 or a newline; the @var{regexp} used is @samp{[^ \t\n]+}.
2886 An empty @var{regexp} is equivalent to not using this option.
2887 @xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs
2890 As a matter of convenience to the user, many usual backslashed escape
2891 sequences, as found in the C language, are recognized and converted to
2892 the corresponding characters by @code{ptx} itself.
2897 @node Output formatting in ptx
2898 @subsection Output formatting
2900 Output format is mainly controlled by the @samp{-O} and @samp{-T} options
2901 described in the table below. When neither @samp{-O} nor @samp{-T} are
2902 selected, and if @sc{gnu} extensions are enabled, the program chooses an
2903 output format suitable for a dumb terminal. Each keyword occurrence is
2904 output to the center of one line, surrounded by its left and right
2905 contexts. Each field is properly justified, so the concordance output
2906 can be readily observed. As a special feature, if automatic
2907 references are selected by option @samp{-A} and are output before the
2908 left context, that is, if option @samp{-R} is @emph{not} selected, then
2909 a colon is added after the reference; this nicely interfaces with @sc{gnu}
2910 Emacs @code{next-error} processing. In this default output format, each
2911 white space character, like newline and tab, is merely changed to
2912 exactly one space, with no special attempt to compress consecutive
2913 spaces. This might change in the future. Except for those white space
2914 characters, every other character of the underlying set of 256
2915 characters is transmitted verbatim.
2917 Output format is further controlled by the following options.
2921 @item -g @var{number}
2922 @itemx --gap-size=@var{number}
2924 Select the size of the minimum white space gap between the fields on the
2927 @item -w @var{number}
2928 @itemx --width=@var{number}
2930 Select the maximum output width of each final line. If references are
2931 used, they are included or excluded from the maximum output width
2932 depending on the value of option @samp{-R}. If this option is not
2933 selected, that is, when references are output before the left context,
2934 the maximum output width takes into account the maximum length of all
2935 references. If this option is selected, that is, when references are
2936 output after the right context, the maximum output width does not take
2937 into account the space taken by references, nor the gap that precedes
2941 @itemx --auto-reference
2943 Select automatic references. Each input line will have an automatic
2944 reference made up of the file name and the line ordinal, with a single
2945 colon between them. However, the file name will be empty when standard
2946 input is being read. If both @samp{-A} and @samp{-r} are selected, then
2947 the input reference is still read and skipped, but the automatic
2948 reference is used at output time, overriding the input reference.
2951 @itemx --right-side-refs
2953 In the default output format, when option @samp{-R} is not used, any
2954 references produced by the effect of options @samp{-r} or @samp{-A} are
2955 placed to the far right of output lines, after the right context. With
2956 default output format, when the @samp{-R} option is specified, references
2957 are rather placed at the beginning of each output line, before the left
2958 context. For any other output format, option @samp{-R} is
2959 ignored, with one exception: with @samp{-R} the width of references
2960 is @emph{not} taken into account in total output width given by @samp{-w}.
2962 This option is automatically selected whenever @sc{gnu} extensions are
2965 @item -F @var{string}
2966 @itemx --flac-truncation=@var{string}
2968 This option will request that any truncation in the output be reported
2969 using the string @var{string}. Most output fields theoretically extend
2970 towards the beginning or the end of the current line, or current
2971 sentence, as selected with option @samp{-S}. But there is a maximum
2972 allowed output line width, changeable through option @samp{-w}, which is
2973 further divided into space for various output fields. When a field has
2974 to be truncated because it cannot extend beyond the beginning or the end of
2975 the current line to fit in, then a truncation occurs. By default,
2976 the string used is a single slash, as in @samp{-F /}.
2978 @var{string} may have more than one character, as in @samp{-F ...}.
2979 Also, in the particular case when @var{string} is empty (@samp{-F ""}),
2980 truncation flagging is disabled, and no truncation marks are appended in
2983 As a matter of convenience to the user, many usual backslashed escape
2984 sequences, as found in the C language, are recognized and converted to
2985 the corresponding characters by @code{ptx} itself.
2987 @item -M @var{string}
2988 @itemx --macro-name=@var{string}
2990 Select another @var{string} to be used instead of @samp{xx}, while
2991 generating output suitable for @code{nroff}, @code{troff} or @TeX{}.
2994 @itemx --format=roff
2996 Choose an output format suitable for @code{nroff} or @code{troff}
2997 processing. Each output line will look like:
3000 .xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}"
3003 so it will be possible to write a @samp{.xx} roff macro to take care of
3004 the output typesetting. This is the default output format when @sc{gnu}
3005 extensions are disabled. Option @samp{-M} can be used to change
3006 @samp{xx} to another macro name.
3008 In this output format, each non-graphical character, like newline and
3009 tab, is merely changed to exactly one space, with no special attempt to
3010 compress consecutive spaces. Each quote character: @kbd{"} is doubled
3011 so it will be correctly processed by @code{nroff} or @code{troff}.
3016 Choose an output format suitable for @TeX{} processing. Each output
3017 line will look like:
3020 \xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@}
3024 so it will be possible to write a @code{\xx} definition to take care of
3025 the output typesetting. Note that when references are not being
3026 produced, that is, neither option @samp{-A} nor option @samp{-r} is
3027 selected, the last parameter of each @code{\xx} call is inhibited.
3028 Option @samp{-M} can be used to change @samp{xx} to another macro
3031 In this output format, some special characters, like @kbd{$}, @kbd{%},
3032 @kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a
3033 backslash. Curly brackets @kbd{@{}, @kbd{@}} are protected with a
3034 backslash and a pair of dollar signs (to force mathematical mode). The
3035 backslash itself produces the sequence @code{\backslash@{@}}.
3036 Circumflex and tilde diacritics produce the sequence @code{^\@{ @}} and
3037 @code{~\@{ @}} respectively. Other diacriticized characters of the
3038 underlying character set produce an appropriate @TeX{} sequence as far
3039 as possible. The other non-graphical characters, like newline and tab,
3040 and all other characters which are not part of @sc{ascii}, are merely
3041 changed to exactly one space, with no special attempt to compress
3042 consecutive spaces. Let me know how to improve this special character
3043 processing for @TeX{}.
3048 @node Compatibility in ptx
3049 @subsection The @sc{gnu} extensions to @code{ptx}
3051 This version of @code{ptx} contains a few features which do not exist in
3052 System V @code{ptx}. These extra features are suppressed by using the
3053 @samp{-G} command line option, unless overridden by other command line
3054 options. Some @sc{gnu} extensions cannot be recovered by overriding, so the
3055 simple rule is to avoid @samp{-G} if you care about @sc{gnu} extensions.
3056 Here are the differences between this program and System V @code{ptx}.
3061 This program can read many input files at once, it always writes the
3062 resulting concordance on standard output. On the other hand, System V
3063 @code{ptx} reads only one file and sends the result to standard output
3064 or, if a second @var{file} parameter is given on the command, to that
3067 Having output parameters not introduced by options is a dangerous
3068 practice which @sc{gnu} avoids as far as possible. So, for using @code{ptx}
3069 portably between @sc{gnu} and System V, you should always use it with a
3070 single input file, and always expect the result on standard output. You
3071 might also want to automatically configure in a @samp{-G} option to
3072 @code{ptx} calls in products using @code{ptx}, if the configurator finds
3073 that the installed @code{ptx} accepts @samp{-G}.
3076 The only options available in System V @code{ptx} are options @samp{-b},
3077 @samp{-f}, @samp{-g}, @samp{-i}, @samp{-o}, @samp{-r}, @samp{-t} and
3078 @samp{-w}. All other options are @sc{gnu} extensions and are not repeated in
3079 this enumeration. Moreover, some options have a slightly different
3080 meaning when @sc{gnu} extensions are enabled, as explained below.
3083 By default, concordance output is not formatted for @code{troff} or
3084 @code{nroff}. It is rather formatted for a dumb terminal. @code{troff}
3085 or @code{nroff} output may still be selected through option @samp{-O}.
3088 Unless @samp{-R} option is used, the maximum reference width is
3089 subtracted from the total output line width. With @sc{gnu} extensions
3090 disabled, width of references is not taken into account in the output
3091 line width computations.
3094 All 256 characters, even @kbd{NUL}s, are always read and processed from
3095 input file with no adverse effect, even if @sc{gnu} extensions are disabled.
3096 However, System V @code{ptx} does not accept 8-bit characters, a few
3097 control characters are rejected, and the tilde @kbd{~} is also rejected.
3100 Input line length is only limited by available memory, even if @sc{gnu}
3101 extensions are disabled. However, System V @code{ptx} processes only
3102 the first 200 characters in each line.
3105 The break (non-word) characters default to be every character except all
3106 letters of the underlying character set, diacriticized or not. When @sc{gnu}
3107 extensions are disabled, the break characters default to space, tab and
3111 The program makes better use of output line width. If @sc{gnu} extensions
3112 are disabled, the program rather tries to imitate System V @code{ptx},
3113 but still, there are some slight disposition glitches this program does
3114 not completely reproduce.
3117 The user can specify both an Ignore file and an Only file. This is not
3118 allowed with System V @code{ptx}.
3123 @node Operating on fields within a line
3124 @chapter Operating on fields within a line
3127 * cut invocation:: Print selected parts of lines.
3128 * paste invocation:: Merge lines of files.
3129 * join invocation:: Join lines on a common field.
3133 @node cut invocation
3134 @section @code{cut}: Print selected parts of lines
3137 @code{cut} writes to standard output selected parts of each line of each
3138 input file, or standard input if no files are given or for a file name of
3142 cut [@var{option}]@dots{} [@var{file}]@dots{}
3145 In the table which follows, the @var{byte-list}, @var{character-list},
3146 and @var{field-list} are one or more numbers or ranges (two numbers
3147 separated by a dash) separated by commas. Bytes, characters, and
3148 fields are numbered starting at 1. Incomplete ranges may be
3149 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
3150 @samp{@var{n}} through end of line or last field.
3152 The program accepts the following options. Also see @ref{Common
3157 @item -b @var{byte-list}
3158 @itemx --bytes=@var{byte-list}
3161 Print only the bytes in positions listed in @var{byte-list}. Tabs and
3162 backspaces are treated like any other character; they take up 1 byte.
3164 @item -c @var{character-list}
3165 @itemx --characters=@var{character-list}
3167 @opindex --characters
3168 Print only characters in positions listed in @var{character-list}.
3169 The same as @samp{-b} for now, but internationalization will change
3170 that. Tabs and backspaces are treated like any other character; they
3171 take up 1 character.
3173 @item -f @var{field-list}
3174 @itemx --fields=@var{field-list}
3177 Print only the fields listed in @var{field-list}. Fields are
3178 separated by a TAB character by default.
3179 Also print any line that contains no delimiter character, unless
3180 the @samp{--only-delimited} (@samp{-s}) option is specified
3182 @item -d @var{input_delim_byte}
3183 @itemx --delimiter=@var{input_delim_byte}
3185 @opindex --delimiter
3186 For @samp{-f}, fields are separated in the input by the first character
3187 in @var{input_delim_byte} (default is TAB).
3191 Do not split multi-byte characters (no-op for now).
3194 @itemx --only-delimited
3196 @opindex --only-delimited
3197 For @samp{-f}, do not print lines that do not contain the field separator
3200 @itemx --output-delimiter=@var{output_delim_string}
3201 @opindex --output-delimiter
3202 For @samp{-f}, output fields are separated by @var{output_delim_string}.
3203 The default is to use the input delimiter.
3209 @node paste invocation
3210 @section @code{paste}: Merge lines of files
3213 @cindex merging files
3215 @code{paste} writes to standard output lines consisting of sequentially
3216 corresponding lines of each given file, separated by a TAB character.
3217 Standard input is used for a file name of @samp{-} or if no input files
3223 paste [@var{option}]@dots{} [@var{file}]@dots{}
3226 The program accepts the following options. Also see @ref{Common options}.
3234 Paste the lines of one file at a time rather than one line from each
3237 @item -d @var{delim-list}
3238 @itemx --delimiters=@var{delim-list}
3240 @opindex --delimiters
3241 Consecutively use the characters in @var{delim-list} instead of
3242 TAB to separate merged lines. When @var{delim-list} is
3243 exhausted, start again at its beginning.
3248 @node join invocation
3249 @section @code{join}: Join lines on a common field
3252 @cindex common field, joining on
3254 @code{join} writes to standard output a line for each pair of input
3255 lines that have identical join fields. Synopsis:
3258 join [@var{option}]@dots{} @var{file1} @var{file2}
3262 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
3263 meaning standard input. @var{file1} and @var{file2} should be already
3264 sorted in increasing textual order on the join fields, using the
3265 collating sequence specified by the @env{LC_COLLATE} locale. Unless
3266 the @samp{-t} option is given, the input should be sorted ignoring blanks at
3267 the start of the join field, as in @code{sort -b}. If the
3268 @samp{--ignore-case} option is given, lines should be sorted without
3269 regard to the case of characters in the join field, as in @code{sort -f}.
3271 The defaults are: the join field is the first field in each line;
3272 fields in the input are separated by one or more blanks, with leading
3273 blanks on the line ignored; fields in the output are separated by a
3274 space; each output line consists of the join field, the remaining
3275 fields from @var{file1}, then the remaining fields from @var{file2}.
3277 The program accepts the following options. Also see @ref{Common options}.
3281 @item -a @var{file-number}
3283 Print a line for each unpairable line in file @var{file-number} (either
3284 @samp{1} or @samp{2}), in addition to the normal output.
3286 @item -e @var{string}
3288 Replace those output fields that are missing in the input with
3292 @itemx --ignore-case
3294 @opindex --ignore-case
3295 Ignore differences in case when comparing keys.
3296 With this option, the lines of the input files must be ordered in the same way.
3297 Use @samp{sort -f} to produce this ordering.
3299 @item -1 @var{field}
3300 @itemx -j1 @var{field}
3303 Join on field @var{field} (a positive integer) of file 1.
3305 @item -2 @var{field}
3306 @itemx -j2 @var{field}
3309 Join on field @var{field} (a positive integer) of file 2.
3311 @item -j @var{field}
3312 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
3314 @item -o @var{field-list}@dots{}
3315 Construct each output line according to the format in @var{field-list}.
3316 Each element in @var{field-list} is either the single character @samp{0} or
3317 has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
3318 @samp{2} and @var{n} is a positive field number.
3320 A field specification of @samp{0} denotes the join field.
3321 In most cases, the functionality of the @samp{0} field spec
3322 may be reproduced using the explicit @var{m.n} that corresponds
3323 to the join field. However, when printing unpairable lines
3324 (using either of the @samp{-a} or @samp{-v} options), there is no way
3325 to specify the join field using @var{m.n} in @var{field-list}
3326 if there are unpairable lines in both files.
3327 To give @code{join} that functionality, @sc{posix} invented the @samp{0}
3328 field specification notation.
3330 The elements in @var{field-list}
3331 are separated by commas or blanks. Multiple @var{field-list}
3332 arguments can be given after a single @samp{-o} option; the values
3333 of all lists given with @samp{-o} are concatenated together.
3334 All output lines -- including those printed because of any -a or -v
3335 option -- are subject to the specified @var{field-list}.
3338 Use character @var{char} as the input and output field separator.
3340 @item -v @var{file-number}
3341 Print a line for each unpairable line in file @var{file-number}
3342 (either @samp{1} or @samp{2}), instead of the normal output.
3346 In addition, when @sc{gnu} @code{join} is invoked with exactly one argument,
3347 options @samp{--help} and @samp{--version} are recognized. @xref{Common
3351 @node Operating on characters
3352 @chapter Operating on characters
3354 @cindex operating on characters
3356 This commands operate on individual characters.
3359 * tr invocation:: Translate, squeeze, and/or delete characters.
3360 * expand invocation:: Convert tabs to spaces.
3361 * unexpand invocation:: Convert spaces to tabs.
3366 @section @code{tr}: Translate, squeeze, and/or delete characters
3373 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
3376 @code{tr} copies standard input to standard output, performing
3377 one of the following operations:
3381 translate, and optionally squeeze repeated characters in the result,
3383 squeeze repeated characters,
3387 delete characters, then squeeze repeated characters from the result.
3390 The @var{set1} and (if given) @var{set2} arguments define ordered
3391 sets of characters, referred to below as @var{set1} and @var{set2}. These
3392 sets are the characters of the input that @code{tr} operates on.
3393 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
3394 complement (all of the characters that are not in @var{set1}).
3397 * Character sets:: Specifying sets of characters.
3398 * Translating:: Changing one characters to another.
3399 * Squeezing:: Squeezing repeats and deleting.
3400 * Warnings in tr:: Warning messages.
3404 @node Character sets
3405 @subsection Specifying sets of characters
3407 @cindex specifying sets of characters
3409 The format of the @var{set1} and @var{set2} arguments resembles
3410 the format of regular expressions; however, they are not regular
3411 expressions, only lists of characters. Most characters simply
3412 represent themselves in these strings, but the strings can contain
3413 the shorthands listed below, for convenience. Some of them can be
3414 used only in @var{set1} or @var{set2}, as noted below.
3418 @item Backslash escapes
3419 @cindex backslash escapes
3421 A backslash followed by a character not listed below causes an error
3440 The character with the value given by @var{ooo}, which is 1 to 3
3449 The notation @samp{@var{m}-@var{n}} expands to all of the characters
3450 from @var{m} through @var{n}, in ascending order. @var{m} should
3451 collate before @var{n}; if it doesn't, an error results. As an example,
3452 @samp{0-9} is the same as @samp{0123456789}.
3454 @sc{gnu} @code{tr} does not support the System V syntax that uses square
3455 brackets to enclose ranges. Translations specified in that format
3456 sometimes work as expected, since the brackets are often transliterated
3457 to themselves. However, they should be avoided because they sometimes
3458 behave unexpectedly. For example, @samp{tr -d '[0-9]'} deletes brackets
3461 Many historically common and even accepted uses of ranges are not
3462 portable. For example, on @sc{ebcdic} hosts using the @samp{A-Z}
3463 range will not do what most would expect because @samp{A} through @samp{Z}
3464 are not contiguous as they are in @sc{ascii}.
3465 If you can rely on a @sc{posix} compliant version of @code{tr}, then
3466 the best way to work around this is to use character classes (see below).
3467 Otherwise, it is most portable (and most ugly) to enumerate the members
3470 @item Repeated characters
3471 @cindex repeated characters
3473 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
3474 copies of character @var{c}. Thus, @samp{[y*6]} is the same as
3475 @samp{yyyyyy}. The notation @samp{[@var{c}*]} in @var{string2} expands
3476 to as many copies of @var{c} as are needed to make @var{set2} as long as
3477 @var{set1}. If @var{n} begins with @samp{0}, it is interpreted in
3478 octal, otherwise in decimal.
3480 @item Character classes
3481 @cindex characters classes
3483 The notation @samp{[:@var{class}:]} expands to all of the characters in
3484 the (predefined) class @var{class}. The characters expand in no
3485 particular order, except for the @code{upper} and @code{lower} classes,
3486 which expand in ascending order. When the @samp{--delete} (@samp{-d})
3487 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
3488 character class can be used in @var{set2}. Otherwise, only the
3489 character classes @code{lower} and @code{upper} are accepted in
3490 @var{set2}, and then only if the corresponding character class
3491 (@code{upper} and @code{lower}, respectively) is specified in the same
3492 relative position in @var{set1}. Doing this specifies case conversion.
3493 The class names are given below; an error results when an invalid class
3505 Horizontal whitespace.
3514 Printable characters, not including space.
3520 Printable characters, including space.
3523 Punctuation characters.
3526 Horizontal or vertical whitespace.
3535 @item Equivalence classes
3536 @cindex equivalence classes
3538 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
3539 equivalent to @var{c}, in no particular order. Equivalence classes are
3540 a relatively recent invention intended to support non-English alphabets.
3541 But there seems to be no standard way to define them or determine their
3542 contents. Therefore, they are not fully implemented in @sc{gnu} @code{tr};
3543 each character's equivalence class consists only of that character,
3544 which is of no particular use.
3550 @subsection Translating
3552 @cindex translating characters
3554 @code{tr} performs translation when @var{set1} and @var{set2} are
3555 both given and the @samp{--delete} (@samp{-d}) option is not given.
3556 @code{tr} translates each character of its input that is in @var{set1}
3557 to the corresponding character in @var{set2}. Characters not in
3558 @var{set1} are passed through unchanged. When a character appears more
3559 than once in @var{set1} and the corresponding characters in @var{set2}
3560 are not all the same, only the final one is used. For example, these
3561 two commands are equivalent:
3568 A common use of @code{tr} is to convert lowercase characters to
3569 uppercase. This can be done in many ways. Here are three of them:
3572 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
3574 tr '[:lower:]' '[:upper:]'
3578 But note that using ranges like @code{a-z} above is not portable.
3580 When @code{tr} is performing translation, @var{set1} and @var{set2}
3581 typically have the same length. If @var{set1} is shorter than
3582 @var{set2}, the extra characters at the end of @var{set2} are ignored.
3584 On the other hand, making @var{set1} longer than @var{set2} is not
3585 portable; @sc{posix.2} says that the result is undefined. In this situation,
3586 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
3587 the last character of @var{set2} as many times as necessary. System V
3588 @code{tr} truncates @var{set1} to the length of @var{set2}.
3590 By default, @sc{gnu} @code{tr} handles this case like BSD @code{tr}. When
3591 the @samp{--truncate-set1} (@samp{-t}) option is given, @sc{gnu} @code{tr}
3592 handles this case like the System V @code{tr} instead. This option is
3593 ignored for operations other than translation.
3595 Acting like System V @code{tr} in this case breaks the relatively common
3599 tr -cs A-Za-z0-9 '\012'
3603 because it converts only zero bytes (the first element in the
3604 complement of @var{set1}), rather than all non-alphanumerics, to
3608 By the way, the above idiom is not portable because it uses ranges.
3609 Assuming a @sc{posix} compliant @code{tr}, here is a better way to write it:
3612 tr -cs '[:alnum:]' '[\n*]'
3617 @subsection Squeezing repeats and deleting
3619 @cindex squeezing repeat characters
3620 @cindex deleting characters
3622 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
3623 removes any input characters that are in @var{set1}.
3625 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
3626 @code{tr} replaces each input sequence of a repeated character that
3627 is in @var{set1} with a single occurrence of that character.
3629 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
3630 first performs any deletions using @var{set1}, then squeezes repeats
3631 from any remaining characters using @var{set2}.
3633 The @samp{--squeeze-repeats} option may also be used when translating,
3634 in which case @code{tr} first performs translation, then squeezes
3635 repeats from any remaining characters using @var{set2}.
3637 Here are some examples to illustrate various combinations of options:
3642 Remove all zero bytes:
3649 Put all words on lines by themselves. This converts all
3650 non-alphanumeric characters to newlines, then squeezes each string
3651 of repeated newlines into a single newline:
3654 tr -cs '[:alnum:]' '[\n*]'
3658 Convert each sequence of repeated newlines to a single newline:
3665 Find doubled occurrences of words in a document.
3666 For example, people often write ``the the'' with the duplicated words
3667 separated by a newline. The bourne shell script below works first
3668 by converting each sequence of punctuation and blank characters to a
3669 single newline. That puts each ``word'' on a line by itself.
3670 Next it maps all uppercase characters to lower case, and finally it
3671 runs @code{uniq} with the @samp{-d} option to print out only the words
3672 that were adjacent duplicates.
3677 | tr -s '[:punct:][:blank:]' '\n' \
3678 | tr '[:upper:]' '[:lower:]' \
3683 Deleting a small set of characters is usually straightforward. For example,
3684 to remove all @samp{a}s, @samp{x}s, and @samp{M}s you would do this:
3690 However, when @samp{-} is one of those characters, it can be tricky because
3691 @samp{-} has special meanings. Performing the same task as above but also
3692 removing all @samp{-} characters, we might try @code{tr -d -axM}, but
3693 that would fail because @code{tr} would try to interpret @samp{-a} as
3694 a command-line option. Alternatively, we could try putting the hyphen
3695 inside the string, @code{tr -d a-xM}, but that wouldn't work either because
3696 it would make @code{tr} interpret @code{a-x} as the range of characters
3697 @samp{a}@dots{}@samp{x} rather than the three.
3698 One way to solve the problem is to put the hyphen at the end of the list
3705 More generally, use the character class notation @code{[=c=]}
3706 with @samp{-} (or any other character) in place of the @samp{c}:
3712 Note how single quotes are used in the above example to protect the
3713 square brackets from interpretation by a shell.
3718 @node Warnings in tr
3719 @subsection Warning messages
3721 @vindex POSIXLY_CORRECT
3722 Setting the environment variable @env{POSIXLY_CORRECT} turns off the
3723 following warning and error messages, for strict compliance with
3724 @sc{posix.2}. Otherwise, the following diagnostics are issued:
3729 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
3730 is not, and @var{set2} is given, @sc{gnu} @code{tr} by default prints
3731 a usage message and exits, because @var{set2} would not be used.
3732 The @sc{posix} specification says that @var{set2} must be ignored in
3733 this case. Silently ignoring arguments is a bad idea.
3736 When an ambiguous octal escape is given. For example, @samp{\400}
3737 is actually @samp{\40} followed by the digit @samp{0}, because the
3738 value 400 octal does not fit into a single byte.
3742 @sc{gnu} @code{tr} does not provide complete BSD or System V compatibility.
3743 For example, it is impossible to disable interpretation of the @sc{posix}
3744 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}. Also, @sc{gnu}
3745 @code{tr} does not delete zero bytes automatically, unlike traditional
3746 Unix versions, which provide no way to preserve zero bytes.
3749 @node expand invocation
3750 @section @code{expand}: Convert tabs to spaces
3753 @cindex tabs to spaces, converting
3754 @cindex converting tabs to spaces
3756 @code{expand} writes the contents of each given @var{file}, or standard
3757 input if none are given or for a @var{file} of @samp{-}, to standard
3758 output, with tab characters converted to the appropriate number of
3762 expand [@var{option}]@dots{} [@var{file}]@dots{}
3765 By default, @code{expand} converts all tabs to spaces. It preserves
3766 backspace characters in the output; they decrement the column count for
3767 tab calculations. The default action is equivalent to @samp{-8} (set
3768 tabs every 8 columns).
3770 The program accepts the following options. Also see @ref{Common options}.
3774 @item -@var{tab1}[,@var{tab2}]@dots{}
3775 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3776 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3780 @cindex tabstops, setting
3781 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3782 (default is 8). Otherwise, set the tabs at columns @var{tab1},
3783 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
3784 last tabstop given with single spaces. If the tabstops are specified
3785 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3786 blanks as well as by commas.
3792 @cindex initial tabs, converting
3793 Only convert initial tabs (those that precede all non-space or non-tab
3794 characters) on each line to spaces.
3799 @node unexpand invocation
3800 @section @code{unexpand}: Convert spaces to tabs
3804 @code{unexpand} writes the contents of each given @var{file}, or
3805 standard input if none are given or for a @var{file} of @samp{-}, to
3806 standard output, with strings of two or more space or tab characters
3807 converted to as many tabs as possible followed by as many spaces as are
3811 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
3814 By default, @code{unexpand} converts only initial spaces and tabs (those
3815 that precede all non space or tab characters) on each line. It
3816 preserves backspace characters in the output; they decrement the column
3817 count for tab calculations. By default, tabs are set at every 8th
3820 The program accepts the following options. Also see @ref{Common options}.
3824 @item -@var{tab1}[,@var{tab2}]@dots{}
3825 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3826 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3830 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3831 instead of the default 8. Otherwise, set the tabs at columns
3832 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
3833 tabs beyond the tabstops given unchanged. If the tabstops are specified
3834 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3835 blanks as well as by commas. This option implies the @samp{-a} option.
3841 Convert all strings of two or more spaces or tabs, not just initial
3848 @node Opening the software toolbox
3849 @chapter Opening the software toolbox
3851 This chapter originally appeared in @cite{Linux Journal}, volume 1,
3852 number 2, in the @cite{What's GNU?} column. It was written by Arnold
3856 * Toolbox introduction:: Toolbox introduction
3857 * I/O redirection:: I/O redirection
3858 * The who command:: The @code{who} command
3859 * The cut command:: The @code{cut} command
3860 * The sort command:: The @code{sort} command
3861 * The uniq command:: The @code{uniq} command
3862 * Putting the tools together:: Putting the tools together
3866 @node Toolbox introduction
3867 @unnumberedsec Toolbox introduction
3869 This month's column is only peripherally related to the @sc{gnu} Project, in
3870 that it describes a number of the @sc{gnu} tools on your Linux system and how
3871 they might be used. What it's really about is the ``Software Tools'' philosophy
3872 of program development and usage.
3874 The software tools philosophy was an important and integral concept
3875 in the initial design and development of Unix (of which Linux and @sc{gnu} are
3876 essentially clones). Unfortunately, in the modern day press of
3877 Internetworking and flashy GUIs, it seems to have fallen by the
3878 wayside. This is a shame, since it provides a powerful mental model
3879 for solving many kinds of problems.
3881 Many people carry a Swiss Army knife around in their pants pockets (or
3882 purse). A Swiss Army knife is a handy tool to have: it has several knife
3883 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
3884 a number of other things on it. For the everyday, small miscellaneous jobs
3885 where you need a simple, general purpose tool, it's just the thing.
3887 On the other hand, an experienced carpenter doesn't build a house using
3888 a Swiss Army knife. Instead, he has a toolbox chock full of specialized
3889 tools---a saw, a hammer, a screwdriver, a plane, and so on. And he knows
3890 exactly when and where to use each tool; you won't catch him hammering nails
3891 with the handle of his screwdriver.
3893 The Unix developers at Bell Labs were all professional programmers and trained
3894 computer scientists. They had found that while a one-size-fits-all program
3895 might appeal to a user because there's only one program to use, in practice
3903 difficult to maintain and
3907 difficult to extend to meet new situations.
3910 Instead, they felt that programs should be specialized tools. In short, each
3911 program ``should do one thing well.'' No more and no less. Such programs are
3912 simpler to design, write, and get right---they only do one thing.
3914 Furthermore, they found that with the right machinery for hooking programs
3915 together, that the whole was greater than the sum of the parts. By combining
3916 several special purpose programs, you could accomplish a specific task
3917 that none of the programs was designed for, and accomplish it much more
3918 quickly and easily than if you had to write a special purpose program.
3919 We will see some (classic) examples of this further on in the column.
3920 (An important additional point was that, if necessary, take a detour
3921 and build any software tools you may need first, if you don't already
3922 have something appropriate in the toolbox.)
3924 @node I/O redirection
3925 @unnumberedsec I/O redirection
3927 Hopefully, you are familiar with the basics of I/O redirection in the
3928 shell, in particular the concepts of ``standard input,'' ``standard output,''
3929 and ``standard error''. Briefly, ``standard input'' is a data source, where
3930 data comes from. A program should not need to either know or care if the
3931 data source is a disk file, a keyboard, a magnetic tape, or even a punched
3932 card reader. Similarly, ``standard output'' is a data sink, where data goes
3933 to. The program should neither know nor care where this might be.
3934 Programs that only read their standard input, do something to the data,
3935 and then send it on, are called ``filters'', by analogy to filters in a
3938 With the Unix shell, it's very easy to set up data pipelines:
3941 program_to_create_data | filter1 | .... | filterN > final.pretty.data
3944 We start out by creating the raw data; each filter applies some successive
3945 transformation to the data, until by the time it comes out of the pipeline,
3946 it is in the desired form.
3948 This is fine and good for standard input and standard output. Where does the
3949 standard error come in to play? Well, think about @code{filter1} in
3950 the pipeline above. What happens if it encounters an error in the data it
3951 sees? If it writes an error message to standard output, it will just
3952 disappear down the pipeline into @code{filter2}'s input, and the
3953 user will probably never see it. So programs need a place where they can send
3954 error messages so that the user will notice them. This is standard error,
3955 and it is usually connected to your console or window, even if you have
3956 redirected standard output of your program away from your screen.
3958 For filter programs to work together, the format of the data has to be
3959 agreed upon. The most straightforward and easiest format to use is simply
3960 lines of text. Unix data files are generally just streams of bytes, with
3961 lines delimited by the @sc{ascii} @sc{lf} (Line Feed) character,
3962 conventionally called a ``newline'' in the Unix literature. (This is
3963 @code{'\n'} if you're a C programmer.) This is the format used by all
3964 the traditional filtering programs. (Many earlier operating systems
3965 had elaborate facilities and special purpose programs for managing
3966 binary data. Unix has always shied away from such things, under the
3967 philosophy that it's easiest to simply be able to view and edit your
3968 data with a text editor.)
3970 OK, enough introduction. Let's take a look at some of the tools, and then
3971 we'll see how to hook them together in interesting ways. In the following
3972 discussion, we will only present those command line options that interest
3973 us. As you should always do, double check your system documentation
3976 @node The who command
3977 @unnumberedsec The @code{who} command
3979 The first program is the @code{who} command. By itself, it generates a
3980 list of the users who are currently logged in. Although I'm writing
3981 this on a single-user system, we'll pretend that several people are
3986 arnold console Jan 22 19:57
3987 miriam ttyp0 Jan 23 14:19(:0.0)
3988 bill ttyp1 Jan 21 09:32(:0.0)
3989 arnold ttyp2 Jan 23 20:48(:0.0)
3992 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
3993 There are three people logged in, and I am logged in twice. On traditional
3994 Unix systems, user names are never more than eight characters long. This
3995 little bit of trivia will be useful later. The output of @code{who} is nice,
3996 but the data is not all that exciting.
3998 @node The cut command
3999 @unnumberedsec The @code{cut} command
4001 The next program we'll look at is the @code{cut} command. This program
4002 cuts out columns or fields of input data. For example, we can tell it
4003 to print just the login name and full name from the @file{/etc/passwd
4004 file}. The @file{/etc/passwd} file has seven fields, separated by
4008 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
4011 To get the first and fifth fields, we would use cut like this:
4014 $ cut -d: -f1,5 /etc/passwd
4017 arnold:Arnold D. Robbins
4018 miriam:Miriam A. Robbins
4022 With the @samp{-c} option, @code{cut} will cut out specific characters
4023 (i.e., columns) in the input lines. This command looks like it might be
4024 useful for data filtering.
4027 @node The sort command
4028 @unnumberedsec The @code{sort} command
4030 Next we'll look at the @code{sort} command. This is one of the most
4031 powerful commands on a Unix-style system; one that you will often find
4032 yourself using when setting up fancy data plumbing. The @code{sort}
4033 command reads and sorts each file named on the command line. It then
4034 merges the sorted data and writes it to standard output. It will read
4035 standard input if no files are given on the command line (thus
4036 making it into a filter). The sort is based on the character collating
4037 sequence or based on user-supplied ordering criteria.
4040 @node The uniq command
4041 @unnumberedsec The @code{uniq} command
4043 Finally (at least for now), we'll look at the @code{uniq} program. When
4044 sorting data, you will often end up with duplicate lines, lines that
4045 are identical. Usually, all you need is one instance of each line.
4046 This is where @code{uniq} comes in. The @code{uniq} program reads its
4047 standard input, which it expects to be sorted. It only prints out one
4048 copy of each duplicated line. It does have several options. Later on,
4049 we'll use the @samp{-c} option, which prints each unique line, preceded
4050 by a count of the number of times that line occurred in the input.
4053 @node Putting the tools together
4054 @unnumberedsec Putting the tools together
4056 Now, let's suppose this is a large BBS system with dozens of users
4057 logged in. The management wants the SysOp to write a program that will
4058 generate a sorted list of logged in users. Furthermore, even if a user
4059 is logged in multiple times, his or her name should only show up in the
4062 The SysOp could sit down with the system documentation and write a C
4063 program that did this. It would take perhaps a couple of hundred lines
4064 of code and about two hours to write it, test it, and debug it.
4065 However, knowing the software toolbox, the SysOp can instead start out
4066 by generating just a list of logged on users:
4076 Next, sort the list:
4079 $ who | cut -c1-8 | sort
4086 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
4089 $ who | cut -c1-8 | sort | uniq
4095 The @code{sort} command actually has a @samp{-u} option that does what
4096 @code{uniq} does. However, @code{uniq} has other uses for which one
4097 cannot substitute @samp{sort -u}.
4099 The SysOp puts this pipeline into a shell script, and makes it available for
4100 all the users on the system:
4103 # cat > /usr/local/bin/listusers
4104 who | cut -c1-8 | sort | uniq
4106 # chmod +x /usr/local/bin/listusers
4109 There are four major points to note here. First, with just four
4110 programs, on one command line, the SysOp was able to save about two
4111 hours worth of work. Furthermore, the shell pipeline is just about as
4112 efficient as the C program would be, and it is much more efficient in
4113 terms of programmer time. People time is much more expensive than
4114 computer time, and in our modern ``there's never enough time to do
4115 everything'' society, saving two hours of programmer time is no mean
4118 Second, it is also important to emphasize that with the
4119 @emph{combination} of the tools, it is possible to do a special
4120 purpose job never imagined by the authors of the individual programs.
4122 Third, it is also valuable to build up your pipeline in stages, as we did here.
4123 This allows you to view the data at each stage in the pipeline, which helps
4124 you acquire the confidence that you are indeed using these tools correctly.
4126 Finally, by bundling the pipeline in a shell script, other users can use
4127 your command, without having to remember the fancy plumbing you set up for
4128 them. In terms of how you run them, shell scripts and compiled programs are
4131 After the previous warm-up exercise, we'll look at two additional, more
4132 complicated pipelines. For them, we need to introduce two more tools.
4134 The first is the @code{tr} command, which stands for ``transliterate.''
4135 The @code{tr} command works on a character-by-character basis, changing
4136 characters. Normally it is used for things like mapping upper case to
4140 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[:upper:]' '[:lower:]'
4141 this example has mixed case!
4144 There are several options of interest:
4148 work on the complement of the listed characters, i.e.,
4149 operations apply to characters not in the given set
4152 delete characters in the first set from the output
4155 squeeze repeated characters in the output into just one character.
4158 We will be using all three options in a moment.
4160 The other command we'll look at is @code{comm}. The @code{comm}
4161 command takes two sorted input files as input data, and prints out the
4162 files' lines in three columns. The output columns are the data lines
4163 unique to the first file, the data lines unique to the second file, and
4164 the data lines that are common to both. The @samp{-1}, @samp{-2}, and
4165 @samp{-3} command line options omit the respective columns. (This is
4166 non-intuitive and takes a little getting used to.) For example:
4188 The single dash as a filename tells @code{comm} to read standard input
4189 instead of a regular file.
4191 Now we're ready to build a fancy pipeline. The first application is a word
4192 frequency counter. This helps an author determine if he or she is over-using
4195 The first step is to change the case of all the letters in our input file
4196 to one case. ``The'' and ``the'' are the same word when doing counting.
4199 $ tr '[:upper:]' '[:lower:]' < whats.gnu | ...
4202 The next step is to get rid of punctuation. Quoted words and unquoted words
4203 should be treated identically; it's easiest to just get the punctuation out of
4207 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' | ...
4210 The second @code{tr} command operates on the complement of the listed
4211 characters, which are all the letters, the digits, the underscore, and
4212 the blank. The @samp{\012} represents the newline character; it has to
4213 be left alone. (The @sc{ascii} tab character should also be included for
4214 good measure in a production script.)
4216 At this point, we have data consisting of words separated by blank space.
4217 The words only contain alphanumeric characters (and the underscore). The
4218 next step is break the data apart so that we have one word per line. This
4219 makes the counting operation much easier, as we will see shortly.
4222 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4223 > tr -s ' ' '\012' | ...
4226 This command turns blanks into newlines. The @samp{-s} option squeezes
4227 multiple newline characters in the output into just one. This helps us
4228 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
4229 This is what the shell prints when it notices you haven't finished
4230 typing in all of a command.)
4232 We now have data consisting of one word per line, no punctuation, all one
4233 case. We're ready to count each word:
4236 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4237 > tr -s ' ' '\012' | sort | uniq -c | ...
4240 At this point, the data might look something like this:
4253 The output is sorted by word, not by count! What we want is the most
4254 frequently used words first. Fortunately, this is easy to accomplish,
4255 with the help of two more @code{sort} options:
4259 do a numeric sort, not a textual one
4262 reverse the order of the sort
4265 The final pipeline looks like this:
4268 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4269 > tr -s ' ' '\012' | sort | uniq -c | sort -nr
4278 Whew! That's a lot to digest. Yet, the same principles apply. With six
4279 commands, on two lines (really one long one split for convenience), we've
4280 created a program that does something interesting and useful, in much
4281 less time than we could have written a C program to do the same thing.
4283 A minor modification to the above pipeline can give us a simple spelling
4284 checker! To determine if you've spelled a word correctly, all you have to
4285 do is look it up in a dictionary. If it is not there, then chances are
4286 that your spelling is incorrect. So, we need a dictionary. If you
4287 have the Slackware Linux distribution, you have the file
4288 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
4291 Now, how to compare our file with the dictionary? As before, we generate
4292 a sorted list of words, one per line:
4295 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4296 > tr -s ' ' '\012' | sort -u | ...
4299 Now, all we need is a list of words that are @emph{not} in the
4300 dictionary. Here is where the @code{comm} command comes in.
4303 $ tr '[:upper:]' '[:lower:]' < whats.gnu | tr -cd '[:alnum:]_ \012' |
4304 > tr -s ' ' '\012' | sort -u |
4305 > comm -23 - /usr/lib/ispell/ispell.words
4308 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
4309 dictionary (the second file), and lines that are in both files. Lines
4310 only in the first file (standard input, our stream of words), are
4311 words that are not in the dictionary. These are likely candidates for
4312 spelling errors. This pipeline was the first cut at a production
4313 spelling checker on Unix.
4315 There are some other tools that deserve brief mention.
4319 search files for text that matches a regular expression
4322 like @code{grep}, but with more powerful regular expressions
4325 count lines, words, characters
4328 a T-fitting for data pipes, copies data to files and to standard output
4331 the stream editor, an advanced tool
4334 a data manipulation language, another advanced tool
4337 The software tools philosophy also espoused the following bit of
4338 advice: ``Let someone else do the hard part.'' This means, take
4339 something that gives you most of what you need, and then massage it the
4340 rest of the way until it's in the form that you want.
4346 Each program should do one thing well. No more, no less.
4349 Combining programs with appropriate plumbing leads to results where
4350 the whole is greater than the sum of the parts. It also leads to novel
4351 uses of programs that the authors might never have imagined.
4354 Programs should never print extraneous header or trailer data, since these
4355 could get sent on down a pipeline. (A point we didn't mention earlier.)
4358 Let someone else do the hard part.
4361 Know your toolbox! Use each program appropriately. If you don't have an
4362 appropriate tool, build one.
4365 As of this writing, all the programs we've discussed are available via
4366 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
4367 @file{/pub/gnu/textutils-1.9.tar.gz}.@footnote{Version 1.9 was current
4368 when this column was written. Check the nearest @sc{gnu} archive for the
4369 current version. The main @sc{gnu} FTP site is now @code{ftp.gnu.org}.}
4371 None of what I have presented in this column is new. The Software Tools
4372 philosophy was first introduced in the book @cite{Software Tools},
4373 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
4374 0-201-03669-X). This book showed how to write and use software
4375 tools. It was written in 1976, using a preprocessor for FORTRAN named
4376 @code{ratfor} (RATional FORtran). At the time, C was not as ubiquitous
4377 as it is now; FORTRAN was. The last chapter presented a @code{ratfor}
4378 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
4379 awful lot like C; if you know C, you won't have any problem following
4382 In 1981, the book was updated and made available as @cite{Software
4383 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7). Both books
4384 remain in print, and are well worth reading if you're a programmer.
4385 They certainly made a major change in how I view programming.
4387 Initially, the programs in both books were available (on 9-track tape)
4388 from Addison-Wesley. Unfortunately, this is no longer the case,
4389 although you might be able to find copies floating around the Internet.
4390 For a number of years, there was an active Software Tools Users Group,
4391 whose members had ported the original @code{ratfor} programs to essentially
4392 every computer system with a FORTRAN compiler. The popularity of the
4393 group waned in the middle '80s as Unix began to spread beyond universities.
4395 With the current proliferation of @sc{gnu} code and other clones of Unix
4396 programs, these programs now receive little attention; modern C versions are
4397 much more efficient and do more than these programs do. Nevertheless, as
4398 exposition of good programming style, and evangelism for a still-valuable
4399 philosophy, these books are unparalleled, and I recommend them highly.
4401 Acknowledgment: I would like to express my gratitude to Brian Kernighan
4402 of Bell Labs, the original Software Toolsmith, for reviewing this column.
4414 @c texinfo-column-for-description: 32