3 @setfilename textutils.info
4 @settitle GNU text utilities
12 @c Put everything in one index (arbitrarily chosen to be the concept index).
20 @set Francois Franc,ois
23 @set Francois Fran\noexpand\ptexc cois
29 * Text utilities: (textutils). GNU text utilities.
30 * cat: (textutils)cat invocation. Concatenate and write files.
31 * cksum: (textutils)cksum invocation. Print @sc{POSIX} CRC checksum.
32 * comm: (textutils)comm invocation. Compare sorted files by line.
33 * csplit: (textutils)csplit invocation. Split by context.
34 * cut: (textutils)cut invocation. Print selected parts of lines.
35 * expand: (textutils)expand invocation. Convert tabs to spaces.
36 * fmt: (textutils)fmt invocation. Reformat paragraph text.
37 * fold: (textutils)fold invocation. Wrap long input lines.
38 * head: (textutils)head invocation. Output the first part of files.
39 * join: (textutils)join invocation. Join lines on a common field.
40 * md5sum: (textutils)md5sum invocation. Print or check message-digests.
41 * nl: (textutils)nl invocation. Number lines and write files.
42 * od: (textutils)od invocation. Dump files in octal, etc.
43 * paste: (textutils)paste invocation. Merge lines of files.
44 * pr: (textutils)pr invocation. Paginate or columnate files.
45 * sort: (textutils)sort invocation. Sort text files.
46 * split: (textutils)split invocation. Split into fixed-size pieces.
47 * sum: (textutils)sum invocation. Print traditional checksum.
48 * tac: (textutils)tac invocation. Reverse files.
49 * tail: (textutils)tail invocation. Output the last part of files.
50 * tr: (textutils)tr invocation. Translate characters.
51 * unexpand: (textutils)unexpand invocation. Convert spaces to tabs.
52 * uniq: (textutils)uniq invocation. Uniqify files.
53 * wc: (textutils)wc invocation. Byte, word, and line counts.
59 This file documents the GNU text utilities.
61 Copyright (C) 1994, 95, 96 Free Software Foundation, Inc.
63 Permission is granted to make and distribute verbatim copies of
64 this manual provided the copyright notice and this permission notice
65 are preserved on all copies.
68 Permission is granted to process this file through TeX and print the
69 results, provided the printed document carries copying permission
70 notice identical to this one except for the removal of this paragraph
71 (this paragraph not being relevant to the printed manual).
74 Permission is granted to copy and distribute modified versions of this
75 manual under the conditions for verbatim copying, provided that the entire
76 resulting derived work is distributed under the terms of a permission
77 notice identical to this one.
79 Permission is granted to copy and distribute translations of this manual
80 into another language, under the above conditions for modified versions,
81 except that this permission notice may be stated in a translation approved
86 @title GNU @code{textutils}
87 @subtitle A set of text utilities
88 @subtitle for version @value{VERSION}, @value{UPDATED}
89 @author David MacKenzie et al.
92 @vskip 0pt plus 1filll
93 Copyright @copyright{} 1994, 95, 96 Free Software Foundation, Inc.
95 Permission is granted to make and distribute verbatim copies of
96 this manual provided the copyright notice and this permission notice
97 are preserved on all copies.
99 Permission is granted to copy and distribute modified versions of this
100 manual under the conditions for verbatim copying, provided that the entire
101 resulting derived work is distributed under the terms of a permission
102 notice identical to this one.
104 Permission is granted to copy and distribute translations of this manual
105 into another language, under the above conditions for modified versions,
106 except that this permission notice may be stated in a translation approved
113 @top GNU text utilities
115 @cindex text utilities
116 @cindex utilities for text handling
118 This manual minimally documents version @value{VERSION} of the GNU text
122 * Introduction:: Caveats, overview, and authors.
123 * Common options:: Common options.
124 * Output of entire files:: cat tac nl od
125 * Formatting file contents:: fmt pr fold
126 * Output of parts of files:: head tail split csplit
127 * Summarizing files:: wc sum cksum md5sum
128 * Operating on sorted files:: sort uniq comm
129 * Operating on fields within a line:: cut paste join
130 * Operating on characters:: tr expand unexpand
131 * Opening the software toolbox:: The software tools philosophy.
132 * Index:: General index.
138 @chapter Introduction
142 This manual is incomplete: No attempt is made to explain basic concepts
143 in a way suitable for novices. Thus, if you are interested, please get
144 involved in improving this manual. The entire GNU community will
148 The GNU text utilities are mostly compatible with the @sc{POSIX.2} standard.
150 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
151 @c sh-utils.texi too -- so be sure to keep them consistent.
152 @cindex bugs, reporting
153 Please report bugs to @samp{bug-gnu-utils@@prep.ai.mit.edu}. Remember
154 to include the version number, machine architecture, input files, and
155 any other information needed to reproduce the bug: your input, what you
156 expected, what you got, and why it is wrong. Diffs are welcome, but
157 please include a description of the problem as well, since this is
158 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
160 This manual is based on the Unix man pages in the distribution, which
161 were originally written by David MacKenzie and updated by Jim Meyering.
162 The original @code{fmt} man page was written by Ross Paterson.
163 @value{Francois} Pinard did the initial conversion to Texinfo format.
164 Karl Berry did the indexing, some reorganization, and editing of the results.
165 Richard Stallman contributed his usual invaluable insights to the
170 @chapter Common options
172 @cindex common options
174 Certain options are available in all these programs. Rather than
175 writing identical descriptions for each of the programs, they are
176 described here. (In fact, every GNU program accepts (or should accept)
179 A few of these programs take arbitrary strings as arguments. In those
180 cases, @samp{--help} and @samp{--version} are taken as these options
181 only if there is one and exactly one command line argument.
188 Print a usage message listing all available options, then exit successfully.
192 @cindex version number, finding
193 Print the version number, then exit successfully.
198 @node Output of entire files
199 @chapter Output of entire files
201 @cindex output of entire files
202 @cindex entire files, output of
204 These commands read and write entire files, possibly transforming them
208 * cat invocation:: Concatenate and write files.
209 * tac invocation:: Concatenate and write files in reverse.
210 * nl invocation:: Number lines and write files.
211 * od invocation:: Write files in octal or other formats.
215 @section @code{cat}: Concatenate and write files
218 @cindex concatenate and write files
219 @cindex copying files
221 @code{cat} copies each @var{file} (@samp{-} means standard input), or
222 standard input if none are given, to standard output. Synopsis:
225 cat [@var{option}] [@var{file}]@dots{}
228 The program accepts the following options. Also see @ref{Common options}.
236 Equivalent to @samp{-vET}.
239 @itemx --number-nonblank
241 @opindex --number-nonblank
242 Number all nonblank output lines, starting with 1.
246 Equivalent to @samp{-vE}.
252 Display a @samp{$} after the end of each line.
258 Number all output lines, starting with 1.
261 @itemx --squeeze-blank
263 @opindex --squeeze-blank
264 @cindex squeezing blank lines
265 Replace multiple adjacent blank lines with a single blank line.
269 Equivalent to @samp{-vT}.
275 Display @key{TAB} characters as @samp{^I}.
279 Ignored; for Unix compatibility.
282 @itemx --show-nonprinting
284 @opindex --show-nonprinting
285 Display control characters except for @key{LFD} and @key{TAB} using
286 @samp{^} notation and precede characters that have the high bit set
293 @section @code{tac}: Concatenate and write files in reverse
296 @cindex reversing files
298 @code{tac} copies each @var{file} (@samp{-} means standard input), or
299 standard input if none are given, to standard output, reversing the
300 records (lines by default) in each separately. Synopsis:
303 tac [@var{option}]@dots{} [@var{file}]@dots{}
306 @dfn{Records} are separated by instances of a string (newline by
307 default). By default, this separator string is attached to the end of
308 the record that it follows in the file.
310 The program accepts the following options. Also see @ref{Common options}.
318 The separator is attached to the beginning of the record that it
319 precedes in the file.
325 Treat the separator string as a regular expression.
327 @item -s @var{separator}
328 @itemx --separator=@var{separator}
331 Use @var{separator} as the record separator, instead of newline.
337 @section @code{nl}: Number lines and write files
340 @cindex numbering lines
341 @cindex line numbering
343 @code{nl} writes each @var{file} (@samp{-} means standard input), or
344 standard input if none are given, to standard output, with line numbers
345 added to some or all of the lines. Synopsis:
348 nl [@var{option}]@dots{} [@var{file}]@dots{}
351 @cindex logical pages, numbering on
352 @code{nl} decomposes its input into (logical) pages; by default, the
353 line number is reset to 1 at the top of each logical page. @code{nl}
354 treats all of the input files as a single document; it does not reset
355 line numbers or logical pages between files.
357 @cindex headers, numbering
358 @cindex body, numbering
359 @cindex footers, numbering
360 A logical page consists of three sections: header, body, and footer.
361 Any of the sections can be empty. Each can be numbered in a different
362 style from the others.
364 The beginnings of the sections of logical pages are indicated in the
365 input file by a line containing exactly one of these delimiter strings:
376 The two characters from which these strings are made can be changed from
377 @samp{\} and @samp{:} via options (see below), but the pattern and
378 length of each string cannot be changed.
380 A section delimiter is replaced by an empty line on output. Any text
381 that comes before the first section delimiter string in the input file
382 is considered to be part of a body section, so @code{nl} treats a
383 file that contains no section delimiters as a single body section.
385 The program accepts the following options. Also see @ref{Common options}.
390 @itemx --body-numbering=@var{style}
392 @opindex --body-numbering
393 Select the numbering style for lines in the body section of each
394 logical page. When a line is not numbered, the current line number
395 is not incremented, but the line number separator character is still
396 prepended to the line. The styles are:
402 number only nonempty lines (default for body),
404 do not number lines (default for header and footer),
406 number only lines that contain a match for @var{regexp}.
410 @itemx --section-delimiter=@var{cd}
412 @opindex --section-delimiter
413 @cindex section delimiters of pages
414 Set the section delimiter characters to @var{cd}; default is
415 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
416 (Remember to protect @samp{\} or other metacharacters from shell
417 expansion with quotes or extra backslashes.)
420 @itemx --footer-numbering=@var{style}
422 @opindex --footer-numbering
423 Analogous to @samp{--body-numbering}.
426 @itemx --header-numbering=@var{style}
428 @opindex --header-numbering
429 Analogous to @samp{--body-numbering}.
431 @item -i @var{number}
432 @itemx --page-increment=@var{number}
434 @opindex --page-increment
435 Increment line numbers by @var{number} (default 1).
437 @item -l @var{number}
438 @itemx --join-blank-lines=@var{number}
440 @opindex --join-blank-lines
441 @cindex empty lines, numbering
442 @cindex blank lines, numbering
443 Consider @var{number} (default 1) consecutive empty lines to be one
444 logical line for numbering, and only number the last one. Where fewer
445 than @var{number} consecutive empty lines occur, do not number them.
446 An empty line is one that contains no characters, not even spaces
449 @item -n @var{format}
450 @itemx --number-format=@var{format}
452 @opindex --number-format
453 Select the line numbering format (default is @code{rn}):
457 @opindex ln @r{format for @code{nl}}
458 left justified, no leading zeros;
460 @opindex rn @r{format for @code{nl}}
461 right justified, no leading zeros;
463 @opindex rz @r{format for @code{nl}}
464 right justified, leading zeros.
470 @opindex --no-renumber
471 Do not reset the line number at the start of a logical page.
473 @item -s @var{string}
474 @itemx --number-separator=@var{string}
476 @opindex --number-separator
477 Separate the line number from the text line in the output with
478 @var{string} (default is @key{TAB}).
480 @item -v @var{number}
481 @itemx --starting-line-number=@var{number}
483 @opindex --starting-line-number
484 Set the initial line number on each logical page to @var{number} (default 1).
486 @item -w @var{number}
487 @itemx --number-width=@var{number}
489 @opindex --number-width
490 Use @var{number} characters for line numbers (default 6).
496 @section @code{od}: Write files in octal or other formats
499 @cindex octal dump of files
500 @cindex hex dump of files
501 @cindex ASCII dump of files
502 @cindex file contents, dumping unambiguously
504 @code{od} writes an unambiguous representation of each @var{file}
505 (@samp{-} means standard input), or standard input if none are given.
509 od [@var{option}]@dots{} [@var{file}]@dots{}
510 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
513 Each line of output consists of the offset in the input, followed by
514 groups of data from the file. By default, @code{od} prints the offset in
515 octal, and each group of file data is two bytes of input printed as a
518 The program accepts the following options. Also see @ref{Common options}.
523 @itemx --address-radix=@var{radix}
525 @opindex --address-radix
526 @cindex radix for file offsets
527 @cindex file offset radix
528 Select the base in which file offsets are printed. @var{radix} can
529 be one of the following:
539 none (do not print offsets).
542 The default is octal.
545 @itemx --skip-bytes=@var{bytes}
547 @opindex --skip-bytes
548 Skip @var{bytes} input bytes before formatting and writing. If
549 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
550 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
551 in decimal. Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
552 by 1024, and @samp{m} by 1048576.
555 @itemx --read-bytes=@var{bytes}
557 @opindex --read-bytes
558 Output at most @var{bytes} bytes of the input. Prefixes and suffixes on
559 @code{bytes} are interpreted as for the @samp{-j} option.
562 @itemx --strings[=@var{n}]
565 @cindex string constants, outputting
566 Instead of the normal output, output only @dfn{string constants}: at
567 least @var{n} (3 by default) consecutive ASCII graphic characters,
568 followed by a null (zero) byte.
571 @itemx --format=@var{type}
574 Select the format in which to output the file data. @var{type} is a
575 string of one or more of the below type indicator characters. If you
576 include more than one type indicator character in a single @var{type}
577 string, or use this option more than once, @code{od} writes one copy
578 of each output line using each of the data types that you specified,
579 in the order that you specified.
585 ASCII character or backslash escape,
598 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
599 newline, and @samp{nul} for a null (zero) byte. Type @code{c} outputs
600 @samp{ }, @samp{\n}, and @code{\0}, respectively.
603 Except for types @samp{a} and @samp{c}, you can specify the number
604 of bytes to use in interpreting each number in the given data type
605 by following the type indicator character with a decimal integer.
606 Alternately, you can specify the size of one of the C compiler's
607 built-in data types by following the type indicator character with
608 one of the following characters. For integers (@samp{d}, @samp{o},
622 For floating point (@code{f}):
634 @itemx --output-duplicates
636 @opindex --output-duplicates
637 Output consecutive lines that are identical. By default, when two or
638 more consecutive output lines would be identical, @code{od} outputs only
639 the first line, and puts just an asterisk on the following line to
640 indicate the elision.
643 @itemx --width[=@var{n}]
646 Dump @code{n} input bytes per output line. This must be a multiple of
647 the least common multiple of the sizes associated with the specified
648 output types. If @var{n} is omitted, the default is 32. If this option
649 is not given at all, the default is 16.
653 The next several options map the old, pre-@sc{POSIX} format specification
654 options to the corresponding @sc{POSIX} format specs. GNU @code{od} accepts
655 any combination of old- and new-style options. Format specification
662 Output as named characters. Equivalent to @samp{-ta}.
666 Output as octal bytes. Equivalent to @samp{-toC}.
670 Output as ASCII characters or backslash escapes. Equivalent to
675 Output as unsigned decimal shorts. Equivalent to @samp{-tu2}.
679 Output as floats. Equivalent to @samp{-tfF}.
683 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
687 Output as decimal shorts. Equivalent to @samp{-td2}.
691 Output as decimal longs. Equivalent to @samp{-td4}.
695 Output as octal shorts. Equivalent to @samp{-to2}.
699 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
703 @opindex --traditional
704 Recognize the pre-POSIX non-option arguments that traditional @code{od}
705 accepted. The following syntax:
708 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
712 can be used to specify at most one file and optional arguments
713 specifying an offset and a pseudo-start address, @var{label}. By
714 default, @var{offset} is interpreted as an octal number specifying how
715 many input bytes to skip before formatting and writing. The optional
716 trailing decimal point forces the interpretation of @var{offset} as a
717 decimal number. If no decimal is specified and the offset begins with
718 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number. If
719 there is a trailing @samp{b}, the number of bytes skipped will be
720 @var{offset} multiplied by 512. The @var{label} argument is interpreted
721 just like @var{offset}, but it specifies an initial pseudo-address. The
722 pseudo-addresses are displayed in parentheses following any normal
728 @node Formatting file contents
729 @chapter Formatting file contents
731 @cindex formatting file contents
733 These commands reformat the contents of files.
736 * fmt invocation:: Reformat paragraph text.
737 * pr invocation:: Paginate or columnate files for printing.
738 * fold invocation:: Wrap input lines to fit in specified width.
743 @section @code{fmt}: Reformat paragraph text
746 @cindex reformatting paragraph text
747 @cindex paragraphs, reformatting
748 @cindex text, reformatting
750 @code{fmt} fills and joins lines to produce output lines of (at most)
751 a given number of characters (75 by default). Synopsis:
754 fmt [@var{option}]@dots{} [@var{file}]@dots{}
757 @code{fmt} reads from the specified @var{file} arguments (or standard
758 input if none are given), and writes to standard output.
760 By default, blank lines, spaces between words, and indentation are
761 preserved in the output; successive input lines with different
762 indentation are not joined; tabs are expanded on input and introduced on
765 @cindex line-breaking
766 @cindex sentences and line-breaking
767 @cindex Knuth, Donald E.
768 @cindex Plass, Michael F.
769 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
770 avoid line breaks after the first word of a sentence or before the last
771 word of a sentence. A @dfn{sentence break} is defined as either the end
772 of a paragraph or a word ending in any of @samp{.?!}, followed by two
773 spaces or end of line, ignoring any intervening parentheses or quotes.
774 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
775 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
776 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
777 and Experience}, 11 (1981), 1119--1184).
779 The program accepts the following options. Also see @ref{Common options}.
784 @itemx --crown-margin
786 @opindex --crown-margin
788 @dfn{Crown margin} mode: preserve the indentation of the first two
789 lines within a paragraph, and align the left margin of each subsequent
790 line with that of the second line.
793 @itemx --tagged-paragraph
795 @opindex --tagged-paragraph
796 @cindex tagged paragraphs
797 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
798 indentation of the first line of a paragraph is the same as the
799 indentation of the second, the first line is treated as a one-line
805 @opindex --split-only
806 Split lines only. Do not join short lines to form longer ones. This
807 prevents sample lines of code, and other such ``formatted'' text from
808 being unduly combined.
811 @itemx --uniform-spacing
813 @opindex --uniform-spacing
814 Uniform spacing. Reduce spacing between words to one space, and spacing
815 between sentences to two spaces.
818 @itemx -w @var{width}
819 @itemx --width=@var{width}
820 @opindex -@var{width}
823 Fill output lines up to @var{width} characters (default 75). @code{fmt}
824 initially tries to make lines about 7% shorter than this, to give it
825 room to balance line lengths.
827 @item -p @var{prefix}
828 @itemx --prefix=@var{prefix}
829 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
830 are subject to formatting. The prefix and any preceding whitespace are
831 stripped for the formatting and then re-attached to each formatted output
832 line. One use is to format certain kinds of program comments, while
833 leaving the code unchanged.
839 @section @code{pr}: Paginate or columnate files for printing
842 @cindex printing, preparing files for
843 @cindex multicolumn output, generating
845 @code{pr} writes each @var{file} (@samp{-} means standard input), or
846 standard input if none are given, to standard output, paginating and
847 optionally outputting in multicolumn format. Synopsis:
850 pr [@var{option}]@dots{} [@var{file}]@dots{}
853 By default, a 5-line header is printed: two blank lines; a line with the
854 date, the file name, and the page count; and two more blank lines. A
855 five line footer (entirely) is also printed.
857 Form feeds in the input cause page breaks in the output.
859 The program accepts the following options. Also see @ref{Common options}.
864 Begin printing with page @var{page}.
867 @opindex -@var{column}
868 Produce @var{column}-column output and print columns down. The column
869 width is automatically decreased as @var{column} increases; unless you
870 use the @samp{-w} option to increase the page width as well, this option
871 might well cause some input to be truncated.
875 @cindex across columns
876 Print columns across rather than down.
880 @cindex balancing columns
881 Balance columns on the last page.
885 Print control characters using hat notation (e.g., @samp{^G}); print
886 other unprintable characters in octal backslash notation. By default,
887 unprintable characters are not changed.
891 @cindex double spacing
892 Double space the output.
894 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
897 Expand tabs to spaces on input. Optional argument @var{in-tabchar} is
898 the input tab character (default is @key{TAB}). Second optional
899 argument @var{in-tabwidth} is the input tab character's width (default
906 Use a formfeed instead of newlines to separate output pages.
908 @item -h @var{header}
910 Replace the file name in the header with the string @var{header}.
912 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
915 Replace spaces with tabs on output. Optional argument @var{out-tabchar}
916 is the output tab character (default is @key{TAB}). Second optional
917 argument @var{out-tabwidth} is the output tab character's width (default
922 Set the page length to @var{n} (default 66) lines. If @var{n} is less
923 than 10, the headers and footers are omitted, as if the @samp{-t} option
928 Print all files in parallel, one in each column.
930 @item -n[@var{number-separator}[@var{digits}]]
932 Precede each column with a line number; with parallel files (@samp{-m}),
933 precede each line with a line number. Optional argument
934 @var{number-separator} is the character to print after each number
935 (default is @key{TAB}). Optional argument @var{digits} is the number of
936 digits per line number (default is 5).
940 @cindex indenting lines
942 Indent each line with @var{n} (default is zero) spaces wide, i.e., set
943 the left margin. The total page width is @samp{n} plus the width set
944 with the @samp{-w} option.
948 Do not print a warning message when an argument @var{file} cannot be
949 opened. (The exit status will still be nonzero, however.)
953 Separate columns by the single character @var{c}. If @var{c} is
954 omitted, the default is space; if this option is omitted altogether, the
955 default is @key{TAB}.
959 Do not print the usual 5-line header and the 5-line footer on each page,
960 and do not fill out the bottoms of pages (with blank lines or
965 Print unprintable characters in octal backslash notation.
969 Set the page width to @var{n} (default is 72) columns.
974 @node fold invocation
975 @section @code{fold}: Wrap input lines to fit in specified width
978 @cindex wrapping long input lines
979 @cindex folding long input lines
981 @code{fold} writes each @var{file} (@samp{-} means standard input), or
982 standard input if none are given, to standard output, breaking long
986 fold [@var{option}]@dots{} [@var{file}]@dots{}
989 By default, @code{fold} breaks lines wider than 80 columns. The output
990 is split into as many lines as necessary.
992 @cindex screen columns
993 @code{fold} counts screen columns by default; thus, a tab may count more
994 than one column, backspace decreases the column count, and carriage
995 return sets the column to zero.
997 The program accepts the following options. Also see @ref{Common options}.
1005 Count bytes rather than columns, so that tabs, backspaces, and carriage
1006 returns are each counted as taking up one column, just like other
1013 Break at word boundaries: the line is broken after the last blank before
1014 the maximum line length. If the line contains no such blanks, the line
1015 is broken at the maximum line length as usual.
1017 @item -w @var{width}
1018 @itemx --width=@var{width}
1021 Use a maximum line length of @var{width} columns instead of 80.
1026 @node Output of parts of files
1027 @chapter Output of parts of files
1029 @cindex output of parts of files
1030 @cindex parts of files, output of
1032 These commands output pieces of the input.
1035 * head invocation:: Output the first part of files.
1036 * tail invocation:: Output the last part of files.
1037 * split invocation:: Split a file into fixed-size pieces.
1038 * csplit invocation:: Split a file into context-determined pieces.
1041 @node head invocation
1042 @section @code{head}: Output the first part of files
1045 @cindex initial part of files, outputting
1046 @cindex first part of files, outputting
1048 @code{head} prints the first part (10 lines by default) of each
1049 @var{file}; it reads from standard input if no files are given or
1050 when given a @var{file} of @samp{-}. Synopses:
1053 head [@var{option}]@dots{} [@var{file}]@dots{}
1054 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1057 If more than one @var{file} is specified, @code{head} prints a
1058 one-line header consisting of
1060 ==> @var{file name} <==
1063 before the output for each @var{file}.
1065 @code{head} accepts two option formats: the new one, in which numbers
1066 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1067 the number precedes any option letters (@samp{-1q}).
1069 The program accepts the following options. Also see @ref{Common options}.
1073 @item -@var{count}@var{options}
1074 @opindex -@var{count}
1075 This option is only recognized if it is specified first. @var{count} is
1076 a decimal number optionally followed by a size letter (@samp{b},
1077 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1078 or other option letters (@samp{cqv}).
1080 @item -c @var{bytes}
1081 @itemx --bytes=@var{bytes}
1084 Print the first @var{bytes} bytes, instead of initial lines. Appending
1085 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1089 @itemx --lines=@var{n}
1092 Output the first @var{n} lines.
1100 Never print file name headers.
1106 Always print file name headers.
1111 @node tail invocation
1112 @section @code{tail}: Output the last part of files
1115 @cindex last part of files, outputting
1117 @code{tail} prints the last part (10 lines by default) of each
1118 @var{file}; it reads from standard input if no files are given or
1119 when given a @var{file} of @samp{-}. Synopses:
1122 tail [@var{option}]@dots{} [@var{file}]@dots{}
1123 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1124 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1127 If more than one @var{file} is specified, @code{tail} prints a
1128 one-line header consisting of
1130 ==> @var{file name} <==
1133 before the output for each @var{file}.
1135 @cindex BSD @code{tail}
1136 GNU @code{tail} can output any amount of data (some other versions of
1137 @code{tail} cannot). It also has no @samp{-r} option (print in
1138 reverse), since reversing a file is really a different job from printing
1139 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1140 only reverse files that are at most as large as its buffer, which is
1141 typically 32k. A more reliable and versatile way to reverse files is
1142 the GNU @code{tac} command.
1144 @code{tail} accepts two option formats: the new one, in which numbers
1145 are arguments to the options (@samp{-n 1}), and the old one, in which
1146 the number precedes any option letters (@samp{-1} or @samp{+1}).
1148 If any option-argument is a number @var{n} starting with a @samp{+},
1149 @code{tail} begins printing with the @var{n}th item from the start of
1150 each file, instead of from the end.
1152 The program accepts the following options. Also see @ref{Common options}.
1158 @opindex -@var{count}
1159 @opindex +@var{count}
1160 This option is only recognized if it is specified first. @var{count} is
1161 a decimal number optionally followed by a size letter (@samp{b},
1162 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1163 or other option letters (@samp{cfqv}).
1165 @item -c @var{bytes}
1166 @itemx --bytes=@var{bytes}
1169 Output the last @var{bytes} bytes, instead of final lines. Appending
1170 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1177 @cindex growing files
1178 Loop forever trying to read more characters at the end of the file,
1179 presumably because the file is growing. Ignored if reading from a pipe.
1180 If more than one file is given, @code{tail} prints a header whenever it
1181 gets output from a different file, to indicate which file that output is
1185 @itemx --lines=@var{n}
1188 Output the last @var{n} lines.
1196 Never print file name headers.
1202 Always print file name headers.
1207 @node split invocation
1208 @section @code{split}: Split a file into fixed-size pieces
1211 @cindex splitting a file into pieces
1212 @cindex pieces, splitting a file into
1214 @code{split} creates output files containing consecutive sections of
1215 @var{input} (standard input if none is given or @var{input} is
1216 @samp{-}). Synopsis:
1219 split [@var{option}] [@var{input} [@var{prefix}]]
1222 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1223 left over for the last section), into each output file.
1225 @cindex output file name prefix
1226 The output files' names consist of @var{prefix} (@samp{x} by default)
1227 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1228 that concatenating the output files in sorted order by file name produces
1229 the original input file. (If more than 676 output files are required,
1230 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1232 The program accepts the following options. Also see @ref{Common options}.
1237 @itemx -l @var{lines}
1238 @itemx --lines=@var{lines}
1241 Put @var{lines} lines of @var{input} into each output file.
1243 @item -b @var{bytes}
1244 @itemx --bytes=@var{bytes}
1247 Put the first @var{bytes} bytes of @var{input} into each output file.
1248 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1249 @samp{m} by 1048576.
1251 @item -C @var{bytes}
1252 @itemx --line-bytes=@var{bytes}
1254 @opindex --line-bytes
1255 Put into each output file as many complete lines of @var{input} as
1256 possible without exceeding @var{bytes} bytes. For lines longer than
1257 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1258 less than @var{bytes} bytes of the line are left, then continue
1259 normally. @var{bytes} has the same format as for the @samp{--bytes}
1262 @itemx --verbose=@var{bytes}
1264 Write a diagnostic to standard error just before each output file is opened.
1269 @node csplit invocation
1270 @section @code{csplit}: Split a file into context-determined pieces
1273 @cindex context splitting
1274 @cindex splitting a file into pieces by context
1276 @code{csplit} creates zero or more output files containing sections of
1277 @var{input} (standard input if @var{input} is @samp{-}). Synopsis:
1280 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1283 The contents of the output files are determined by the @var{pattern}
1284 arguments, as detailed below. An error occurs if a @var{pattern}
1285 argument refers to a nonexistent line of the input file (e.g., if no
1286 remaining line matches a given regular expression). After every
1287 @var{pattern} has been matched, any remaining input is copied into one
1290 By default, @code{csplit} prints the number of bytes written to each
1291 output file after it has been created.
1293 The types of pattern arguments are:
1298 Create an output file containing the input up to but not including line
1299 @var{n} (a positive integer). If followed by a repeat count, also
1300 create an output file containing the next @var{line} lines of the input
1301 file once for each repeat.
1303 @item /@var{regexp}/[@var{offset}]
1304 Create an output file containing the current line up to (but not
1305 including) the next line of the input file that contains a match for
1306 @var{regexp}. The optional @var{offset} is a @samp{+} or @samp{-}
1307 followed by a positive integer. If it is given, the input up to the
1308 matching line plus or minus @var{offset} is put into the output file,
1309 and the line after that begins the next section of input.
1311 @item %@var{regexp}%[@var{offset}]
1312 Like the previous type, except that it does not create an output
1313 file, so that section of the input file is effectively ignored.
1315 @item @{@var{repeat-count}@}
1316 Repeat the previous pattern @var{repeat-count} additional
1317 times. @var{repeat-count} can either be a positive integer or an
1318 asterisk, meaning repeat as many times as necessary until the input is
1323 The output files' names consist of a prefix (@samp{xx} by default)
1324 followed by a suffix. By default, the suffix is an ascending sequence
1325 of two-digit decimal numbers from @samp{00} and up to @samp{99}. In any
1326 case, concatenating the output files in sorted order by filename
1327 produces the original input file.
1329 By default, if @code{csplit} encounters an error or receives a hangup,
1330 interrupt, quit, or terminate signal, it removes any output files
1331 that it has created so far before it exits.
1333 The program accepts the following options. Also see @ref{Common options}.
1337 @item -f @var{prefix}
1338 @itemx --prefix=@var{prefix}
1341 @cindex output file name prefix
1342 Use @var{prefix} as the output file name prefix.
1344 @item -b @var{suffix}
1345 @itemx --suffix=@var{suffix}
1348 @cindex output file name suffix
1349 Use @var{suffix} as the output file name suffix. When this option is
1350 specified, the suffix string must include exactly one
1351 @code{printf(3)}-style conversion specification, possibly including
1352 format specification flags, a field width, a precision specifications,
1353 or all of these kinds of modifiers. The format letter must convert a
1354 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1355 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed. The
1356 entire @var{suffix} is given (with the current output file number) to
1357 @code{sprintf(3)} to form the file name suffixes for each of the
1358 individual output files in turn. If this option is used, the
1359 @samp{--digits} option is ignored.
1361 @item -n @var{digits}
1362 @itemx --digits=@var{digits}
1365 Use output file names containing numbers that are @var{digits} digits
1366 long instead of the default 2.
1371 @opindex --keep-files
1372 Do not remove output files when errors are encountered.
1375 @itemx --elide-empty-files
1377 @opindex --elide-empty-files
1378 Suppress the generation of zero-length output files. (In cases where
1379 the section delimiters of the input file are supposed to mark the first
1380 lines of each of the sections, the first output file will generally be a
1381 zero-length file unless you use this option.) The output file sequence
1382 numbers always run consecutively starting from 0, even when this option
1393 Do not print counts of output file sizes.
1398 @node Summarizing files
1399 @chapter Summarizing files
1401 @cindex summarizing files
1403 These commands generate just a few numbers representing entire
1407 * wc invocation:: Print byte, word, and line counts.
1408 * sum invocation:: Print checksum and block counts.
1409 * cksum invocation:: Print CRC checksum and byte counts.
1410 * md5sum invocation:: Print or check message-digests.
1415 @section @code{wc}: Print byte, word, and line counts
1422 @code{wc} counts the number of bytes, whitespace-separated words, and
1423 newlines in each given @var{file}, or standard input if none are given
1424 or for a @var{file} of @samp{-}. Synopsis:
1427 wc [@var{option}]@dots{} [@var{file}]@dots{}
1430 @cindex total counts
1431 @code{wc} prints one line of counts for each file, and if the file was
1432 given as an argument, it prints the file name following the counts. If
1433 more than one @var{file} is given, @code{wc} prints a final line
1434 containing the cumulative counts, with the file name @file{total}. The
1435 counts are printed in this order: newlines, words, bytes.
1437 By default, @code{wc} prints all three counts. Options can specify
1438 that only certain counts be printed. Options do not undo others
1439 previously given, so
1446 prints both the byte counts and the word counts.
1448 The program accepts the following options. Also see @ref{Common options}.
1458 Print only the byte counts.
1464 Print only the word counts.
1470 Print only the newline counts.
1475 @node sum invocation
1476 @section @code{sum}: Print checksum and block counts
1479 @cindex 16-bit checksum
1480 @cindex checksum, 16-bit
1482 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1483 standard input if none are given or for a @var{file} of @samp{-}. Synopsis:
1486 sum [@var{option}]@dots{} [@var{file}]@dots{}
1489 @code{sum} prints the checksum for each @var{file} followed by the
1490 number of blocks in the file (rounded up). If more than one @var{file}
1491 is given, file names are also printed (by default). (With the
1492 @samp{--sysv} option, corresponding file name are printed when there is
1493 at least one file argument.)
1495 By default, GNU @code{sum} computes checksums using an algorithm
1496 compatible with BSD @code{sum} and prints file sizes in units of
1499 The program accepts the following options. Also see @ref{Common options}.
1505 @cindex BSD @code{sum}
1506 Use the default (BSD compatible) algorithm. This option is included for
1507 compatibility with the System V @code{sum}. Unless @samp{-s} was also
1508 given, it has no effect.
1514 @cindex System V @code{sum}
1515 Compute checksums using an algorithm compatible with System V
1516 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1520 @code{sum} is provided for compatibility; the @code{cksum} program (see
1521 next section) is preferable in new applications.
1524 @node cksum invocation
1525 @section @code{cksum}: Print CRC checksum and byte counts
1528 @cindex cyclic redundancy check
1529 @cindex CRC checksum
1531 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1532 given @var{file}, or standard input if none are given or for a
1533 @var{file} of @samp{-}. Synopsis:
1536 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1539 @code{cksum} prints the CRC checksum for each file along with the number
1540 of bytes in the file, and the filename unless no arguments were given.
1542 @code{cksum} is typically used to ensure that files
1543 transferred by unreliable means (e.g., netnews) have not been corrupted,
1544 by comparing the @code{cksum} output for the received files with the
1545 @code{cksum} output for the original files (typically given in the
1548 The CRC algorithm is specified by the @sc{POSIX.2} standard. It is not
1549 compatible with the BSD or System V @code{sum} algorithms (see the
1550 previous section); it is more robust.
1552 The only options are @samp{--help} and @samp{--version}. @xref{Common
1556 @node md5sum invocation
1557 @section @code{md5sum}: Print or check message-digests
1560 @cindex 128-bit checksum
1561 @cindex checksum, 128-bit
1562 @cindex fingerprint, 128-bit
1563 @cindex message-digest, 128-bit
1565 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1566 @dfn{message-digest}) for each specified @var{file}.
1567 If a @var{file} is specified as @samp{-} or if no files are given
1568 @code{md5sum} computes the checksum for the standard input.
1569 @code{md5sum} can also determine whether a file and checksum are
1570 consistent. Synopsis:
1573 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1574 md5sum [@var{option}]@dots{} --check [@var{file}]
1575 md5sum [@var{option}]@dots{} --string=@var{string} @dots{}
1578 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1579 indicating a binary or text input file, and the filename.
1580 If @var{file} is omitted or specified as @samp{-}, standard input is read.
1582 The program accepts the following options. Also see @ref{Common options}.
1590 @cindex binary input files
1591 Treat all input files as binary. This option has no effect on Unix
1592 systems, since they don't distinguish between binary and text files.
1593 This option is useful on systems that have different internal and
1594 external character representations.
1598 Read filenames and checksum information from the single @var{file}
1599 (or from stdin if no @var{file} was specified) and report whether
1600 each named file and the corresponding checksum data are consistent.
1601 The input to this mode of @code{md5sum} is usually the output of
1602 a prior, checksum-generating run of @samp{md5sum}.
1603 Each valid line of input consists of an MD5 checksum, a binary/text
1604 flag, and then a filename.
1605 Binary files are marked with @samp{*}, text with @samp{ }.
1606 For each such line, @code{md5sum} reads the named file and computes its
1607 MD5 checksum. Then, if the computed message digest does not match the
1608 one on the line with the filename, the file is noted as having
1609 failed the test. Otherwise, the file passes the test.
1610 By default, for each valid line, one line is written to standard
1611 output indicating whether the named file passed the test.
1612 After all checks have been performed, if there were any failures,
1613 a warning is issued to standard error.
1614 Use the @samp{--status} option to inhibit that output.
1615 If any listed file cannot be opened or read, if any valid line has
1616 an MD5 checksum inconsistent with the associated file, or if no valid
1617 line is found, @code{md5sum} exits with nonzero status. Otherwise,
1618 it exits successfully.
1622 @cindex verifying MD5 checksums
1623 This option is useful only when verifying checksums.
1624 When verifying checksums, don't generate the default one-line-per-file
1625 diagnostic and don't output the warning summarizing any failures.
1626 Failures to open or read a file still evoke individual diagnostics to
1628 If all listed files are readable and are consistent with the associated
1629 MD5 checksums, exit successfully. Otherwise exit with a status code
1630 indicating there was a failure.
1632 @itemx --string=@var{string}
1634 Compute the message digest for @var{string}, instead of for a file. The
1635 result is the same as for a file that contains exactly @var{string}.
1641 @cindex text input files
1642 Treat all input files as text files. This is the reverse of
1649 @cindex verifying MD5 checksums
1650 When verifying checksums, warn about improperly formated MD5 checksum lines.
1651 This option is useful only if all but a few lines in the checked input
1657 @node Operating on sorted files
1658 @chapter Operating on sorted files
1660 @cindex operating on sorted files
1661 @cindex sorted files, operations on
1663 These commands work with (or produce) sorted files.
1666 * sort invocation:: Sort text files.
1667 * uniq invocation:: Uniqify files.
1668 * comm invocation:: Compare two sorted files line by line.
1672 @node sort invocation
1673 @section @code{sort}: Sort text files
1676 @cindex sorting files
1678 @code{sort} sorts, merges, or compares all the lines from the given
1679 files, or standard input if none are given or for a @var{file} of
1680 @samp{-}. By default, @code{sort} writes the results to standard
1684 sort [@var{option}]@dots{} [@var{file}]@dots{}
1687 @code{sort} has three modes of operation: sort (the default), merge,
1688 and check for sortedness. The following options change the operation
1695 @cindex checking for sortedness
1696 Check whether the given files are already sorted: if they are not all
1697 sorted, print an error message and exit with a status of 1.
1698 Otherwise, exit successfully.
1702 @cindex merging sorted files
1703 Merge the given files by sorting them as a group. Each input file must
1704 always be individually sorted. It always works to sort instead of
1705 merge; merging is provided because it is faster, in the case where it
1710 A pair of lines is compared as follows: if any key fields have been
1711 specified, @code{sort} compares each pair of fields, in the order
1712 specified on the command line, according to the associated ordering
1713 options, until a difference is found or no fields are left.
1715 If any of the global options @samp{Mbdfinr} are given but no key fields
1716 are specified, @code{sort} compares the entire lines according to the
1719 Finally, as a last resort when all keys compare equal (or if no
1720 ordering options were specified at all), @code{sort} compares the lines
1721 byte by byte in machine collating sequence. The last resort comparison
1722 honors the @samp{-r} global option. The @samp{-s} (stable) option
1723 disables this last-resort comparison so that lines in which all fields
1724 compare equal are left in their original relative order. If no fields
1725 or global options are specified, @samp{-s} has no effect.
1727 GNU @code{sort} (as specified for all GNU utilities) has no limits on
1728 input line length or restrictions on bytes allowed within lines. In
1729 addition, if the final byte of an input file is not a newline, GNU
1730 @code{sort} silently supplies one.
1732 Upon any error, @code{sort} exits with a status of @samp{2}.
1735 If the environment variable @code{TMPDIR} is set, @code{sort} uses its
1736 value as the directory for temporary files instead of @file{/tmp}. The
1737 @samp{-T @var{tempdir}} option in turn overrides the environment
1740 The following options affect the ordering of output lines. They may be
1741 specified globally or as part of a specific key field. If no key
1742 fields are specified, global options apply to comparison of entire
1743 lines; otherwise the global options are inherited by key fields that do
1744 not specify any special options of their own.
1750 @cindex blanks, ignoring leading
1751 Ignore leading blanks when finding sort keys in each line.
1755 @cindex phone directory order
1756 @cindex telephone directory order
1757 Sort in @dfn{phone directory} order: ignore all characters except
1758 letters, digits and blanks when sorting.
1762 @cindex case folding
1763 Fold lowercase characters into the equivalent uppercase characters when
1764 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
1768 @cindex general numeric sort
1769 Sort numerically, but use strtod(3) to arrive at the numeric values.
1770 This allows floating point numbers to be specified in scientific notation,
1771 like @code{1.0e-34} and @code{10e100}. Use this option only if there
1772 is no alternative; it is much slower than @samp{-n} and numbers with
1773 too many significant digits will be compared as if they had been
1774 truncated. In addition, numbers outside the range of representable
1775 double precision floating point numbers are treated as if they were
1776 zeroes; overflow and underflow are not reported.
1780 @cindex unprintable characters, ignoring
1781 Ignore characters outside the printable ASCII range 040-0176 octal
1782 (inclusive) when sorting.
1786 @cindex months, sorting by
1787 An initial string, consisting of any amount of whitespace, followed
1788 by three letters abbreviating a month name, is folded to UPPER case and
1789 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
1790 Invalid names compare low to valid names.
1794 @cindex numeric sort
1795 Sort numerically: the number begins each line; specifically, it consists
1796 of optional whitespace, an optional @samp{-} sign, and zero or more
1797 digits, optionally followed by a decimal point and zero or more digits.
1799 @code{sort -n} uses what might be considered an unconventional method
1800 to compare strings representing floating point numbers. Rather than
1801 first converting each string to the C @code{double} type and then
1802 comparing those values, sort aligns the decimal points in the two
1803 strings and compares the strings a character at a time. One benefit
1804 of using this approach is its speed. In practice this is much more
1805 efficient than performing the two corresponding string-to-double (or even
1806 string-to-integer) conversions and then comparing doubles. In addition,
1807 there is no corresponding loss of precision. Converting each string to
1808 @code{double} before comparison would limit precision to about 16 digits
1811 Neither a leading @samp{+} nor exponential notation is recognized.
1812 To compare such strings numerically, use the @samp{-g} option.
1816 @cindex reverse sorting
1817 Reverse the result of comparison, so that lines with greater key values
1818 appear earlier in the output instead of later.
1826 @item -o @var{output-file}
1828 @cindex overwriting of input, allowed
1829 Write output to @var{output-file} instead of standard output.
1830 If @var{output-file} is one of the input files, @code{sort} copies
1831 it to a temporary file before sorting and writing the output to
1834 @item -t @var{separator}
1836 @cindex field separator character
1837 Use character @var{separator} as the field separator when finding the
1838 sort keys in each line. By default, fields are separated by the empty
1839 string between a non-whitespace character and a whitespace character.
1840 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
1841 into fields @w{@samp{ foo}} and @w{@samp{ bar}}. The field separator is
1842 not considered to be part of either the field preceding or the field
1847 @cindex uniqifying output
1848 For the default case or the @samp{-m} option, only output the first
1849 of a sequence of lines that compare equal. For the @samp{-c} option,
1850 check that no pair of consecutive lines compares equal.
1852 @item -k @var{pos1}[,@var{pos2}]
1855 The recommended, @sc{POSIX}, option for specifying a sort field. The field
1856 consists of the line between @var{pos1} and @var{pos2} (or the end of
1857 the line, if @var{pos2} is omitted), inclusive. Fields and character
1858 positions are numbered starting with 1. See below.
1862 @cindex sort zero-terminated lines
1863 Treat the input as a set of lines, each terminated by a zero byte (@sc{ASCII}
1864 @sc{NUL} (Null) character) instead of a @sc{ASCII} @sc{LF} (Line Feed.)
1865 This option can be useful in conjunction with @samp{perl -0} or
1866 @samp{find -print0} and @samp{xargs -0} which do the same in order to
1867 reliably handle arbitrary pathnames (even those which contain Line Feed
1870 @item +@var{pos1}[-@var{pos2}]
1871 The obsolete, traditional option for specifying a sort field. The field
1872 consists of the line between @var{pos1} and up to but @emph{not including}
1873 @var{pos2} (or the end of the line if @var{pos2} is omitted). Fields
1874 and character positions are numbered starting with 0. See below.
1878 In addition, when GNU @code{sort} is invoked with exactly one argument,
1879 options @samp{--help} and @samp{--version} are recognized. @xref{Common
1882 Historical (BSD and System V) implementations of @code{sort} have
1883 differed in their interpretation of some options, particularly
1884 @samp{-b}, @samp{-f}, and @samp{-n}. GNU sort follows the @sc{POSIX}
1885 behavior, which is usually (but not always!) like the System V behavior.
1886 According to @sc{POSIX}, @samp{-n} no longer implies @samp{-b}. For
1887 consistency, @samp{-M} has been changed in the same way. This may
1888 affect the meaning of character positions in field specifications in
1889 obscure cases. The only fix is to add an explicit @samp{-b}.
1891 A position in a sort field specified with the @samp{-k} or @samp{+}
1892 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
1893 of the field to use and @var{c} is the number of the first character
1894 from the beginning of the field (for @samp{+@var{pos}}) or from the end
1895 of the previous field (for @samp{-@var{pos}}). If the @samp{.@var{c}}
1896 is omitted, it is taken to be the first character in the field. If the
1897 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
1898 specification is counted from the first nonblank character of the field
1899 (for @samp{+@var{pos}}) or from the first nonblank character following
1900 the previous field (for @samp{-@var{pos}}).
1902 A sort key option may also have any of the option letters @samp{Mbdfinr}
1903 appended to it, in which case the global ordering options are not used
1904 for that particular field. The @samp{-b} option may be independently
1905 attached to either or both of the @samp{+@var{pos}} and
1906 @samp{-@var{pos}} parts of a field specification, and if it is inherited
1907 from the global options it will be attached to both. If a @samp{-n} or
1908 @samp{-M} option is used, thus implying a @samp{-b} option, the
1909 @samp{-b} option is taken to apply to both the @samp{+@var{pos}} and the
1910 @samp{-@var{pos}} parts of a key specification. Keys may span multiple
1913 Here are some examples to illustrate various combinations of options.
1914 In them, the @sc{POSIX} @samp{-k} option is used to specify sort keys rather
1915 than the obsolete @samp{+@var{pos1}-@var{pos2}} syntax.
1920 Sort in descending (reverse) numeric order.
1926 Sort alphabetically, omitting the first and second fields.
1927 This uses a single key composed of the characters beginning
1928 at the start of field three and extending to the end of each line.
1935 Sort numerically on the second field and resolve ties by sorting
1936 alphabetically on the third and fourth characters of field five.
1937 Use @samp{:} as the field delimiter.
1940 sort -t : -k 2,2n -k 5.3,5.4
1943 Note that if you had written @samp{-k 2} instead of @samp{-k 2,2}
1944 @samp{sort} would have used all characters beginning in the second field
1945 and extending to the end of the line as the primary @emph{numeric}
1946 key. For the large majority of applications, treating keys spanning
1947 more than one field as numeric will not do what you expect.
1949 Also note that the @samp{n} modifier was applied to the field-end
1950 specifier for the first key. It would have been equivalent to
1951 specify @samp{-k 2n,2} or @samp{-k 2n,2n}. All modifiers except
1952 @samp{b} apply to the associated @emph{field}, regardless of whether
1953 the modifier character is attached to the field-start and/or the
1954 field-end part of the key specifier.
1957 Sort the password file on the fifth field and ignore any
1958 leading white space. Sort lines with equal values in field five
1959 on the numeric user ID in field three.
1962 sort -t : -k 5b,5 -k 3,3n /etc/passwd
1965 An alternative is to use the global numeric modifier @samp{-n}.
1968 sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
1972 Generate a tags file in case insensitive sorted order.
1974 find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
1977 The use of @samp{-print0}, @samp{-z}, and @samp{-0} in this case mean
1978 that pathnames that contain Line Feed characters will not get broken up
1979 by the sort operation.
1981 Finally, to ignore both leading and trailing white space, you
1982 could have applied the @samp{b} modifier to the field-end specifier
1986 sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
1989 or by using the global @samp{-b} modifier instead of @samp{-n}
1990 and an explicit @samp{n} with the second key specifier.
1993 sort -t : -b -k 5,5 -k 3,3n /etc/passwd
1999 @node uniq invocation
2000 @section @code{uniq}: Uniqify files
2003 @cindex uniqify files
2005 @code{uniq} writes the unique lines in the given @file{input}, or
2006 standard input if nothing is given or for an @var{input} name of
2010 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
2013 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
2014 discards all but one of identical successive lines. Optionally, it can
2015 instead show only lines that appear exactly once, or lines that appear
2018 The input must be sorted. If your input is not sorted, perhaps you want
2019 to use @code{sort -u}.
2021 If no @var{output} file is specified, @code{uniq} writes to standard
2024 The program accepts the following options. Also see @ref{Common options}.
2030 @itemx --skip-fields=@var{n}
2033 @opindex --skip-fields
2034 Skip @var{n} fields on each line before checking for uniqueness. Fields
2035 are sequences of non-space non-tab characters that are separated from
2036 each other by at least one spaces or tabs.
2040 @itemx --skip-chars=@var{n}
2043 @opindex --skip-chars
2044 Skip @var{n} characters before checking for uniqueness. If you use both
2045 the field and character skipping options, fields are skipped over first.
2051 Print the number of times each line occurred along with the line.
2054 @itemx --ignore-case
2056 @opindex --ignore-case
2057 Ignore differences in case when comparing lines.
2063 @cindex duplicate lines, outputting
2064 Print only duplicate lines.
2070 @cindex unique lines, outputting
2071 Print only unique lines.
2074 @itemx --check-chars=@var{n}
2076 @opindex --check-chars
2077 Compare @var{n} characters on each line (after skipping any specified
2078 fields and characters). By default the entire rest of the lines are
2084 @node comm invocation
2085 @section @code{comm}: Compare two sorted files line by line
2088 @cindex line-by-line comparison
2089 @cindex comparing sorted files
2091 @code{comm} writes to standard output lines that are common, and lines
2092 that are unique, to two input files; a file name of @samp{-} means
2093 standard input. Synopsis:
2096 comm [@var{option}]@dots{} @var{file1} @var{file2}
2099 The input files must be sorted before @code{comm} can be used.
2101 @cindex differing lines
2102 @cindex common lines
2103 With no options, @code{comm} produces three column output. Column one
2104 contains lines unique to @var{file1}, column two contains lines unique
2105 to @var{file2}, and column three contains lines common to both files.
2106 Columns are separated by @key{TAB}.
2107 @c FIXME: when there's an option to supply an alternative separator
2108 @c string, append `by default' to the above sentence.
2113 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
2114 the corresponding columns. Also see @ref{Common options}.
2117 @node Operating on fields within a line
2118 @chapter Operating on fields within a line
2121 * cut invocation:: Print selected parts of lines.
2122 * paste invocation:: Merge lines of files.
2123 * join invocation:: Join lines on a common field.
2127 @node cut invocation
2128 @section @code{cut}: Print selected parts of lines
2131 @code{cut} writes to standard output selected parts of each line of each
2132 input file, or standard input if no files are given or for a file name of
2136 cut [@var{option}]@dots{} [@var{file}]@dots{}
2139 In the table which follows, the @var{byte-list}, @var{character-list},
2140 and @var{field-list} are one or more numbers or ranges (two numbers
2141 separated by a dash) separated by commas. Bytes, characters, and
2142 fields are numbered from starting at 1. Incomplete ranges may be
2143 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
2144 @samp{@var{n}} through end of line or last field.
2146 The program accepts the following options. Also see @ref{Common
2151 @item -b @var{byte-list}
2152 @itemx --bytes=@var{byte-list}
2155 Print only the bytes in positions listed in @var{byte-list}. Tabs and
2156 backspaces are treated like any other character; they take up 1 byte.
2158 @item -c @var{character-list}
2159 @itemx --characters=@var{character-list}
2161 @opindex --characters
2162 Print only characters in positions listed in @var{character-list}.
2163 The same as @samp{-b} for now, but internationalization will change
2164 that. Tabs and backspaces are treated like any other character; they
2165 take up 1 character.
2167 @item -f @var{field-list}
2168 @itemx --fields=@var{field-list}
2171 Print only the fields listed in @var{field-list}. Fields are
2172 separated by a @key{TAB} by default.
2174 @item -d @var{delim}
2175 @itemx --delimiter=@var{delim}
2177 @opindex --delimiter
2178 For @samp{-f}, fields are separated by the first character in @var{delim}
2179 (default is @key{TAB}).
2183 Do not split multi-byte characters (no-op for now).
2186 @itemx --only-delimited
2188 @opindex --only-delimited
2189 For @samp{-f}, do not print lines that do not contain the field separator
2195 @node paste invocation
2196 @section @code{paste}: Merge lines of files
2199 @cindex merging files
2201 @code{paste} writes to standard output lines consisting of sequentially
2202 corresponding lines of each given file, separated by @key{TAB}.
2203 Standard input is used for a file name of @samp{-} or if no input files
2209 paste [@var{option}]@dots{} [@var{file}]@dots{}
2212 The program accepts the following options. Also see @ref{Common options}.
2220 Paste the lines of one file at a time rather than one line from each
2223 @item -d @var{delim-list}
2224 @itemx --delimiters @var{delim-list}
2226 @opindex --delimiters
2227 Consecutively use the characters in @var{delim-list} instead of
2228 @key{TAB} to separate merged lines. When @var{delim-list} is
2229 exhausted, start again at its beginning.
2234 @node join invocation
2235 @section @code{join}: Join lines on a common field
2238 @cindex common field, joining on
2240 @code{join} writes to standard output a line for each pair of input
2241 lines that have identical join fields. Synopsis:
2244 join [@var{option}]@dots{} @var{file1} @var{file2}
2247 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
2248 meaning standard input. @var{file1} and @var{file2} should be already
2249 sorted in increasing order (not numerically) on the join fields; unless
2250 the @samp{-t} option is given, they should be sorted ignoring blanks at
2251 the start of the join field, as in @code{sort -b}. If the
2252 @samp{--ignore-case} option is given, lines should be sorted without
2253 regard to the case of characters in the join field, as in @code{sort -f}.
2255 The defaults are: the join field is the first field in each line;
2256 fields in the input are separated by one or more blanks, with leading
2257 blanks on the line ignored; fields in the output are separated by a
2258 space; each output line consists of the join field, the remaining
2259 fields from @var{file1}, then the remaining fields from @var{file2}.
2261 The program accepts the following options. Also see @ref{Common options}.
2265 @item -a @var{file-number}
2267 Print a line for each unpairable line in file @var{file-number} (either
2268 @samp{1} or @samp{2}), in addition to the normal output.
2270 @item -e @var{string}
2272 Replace those output fields that are missing in the input with
2276 @itemx --ignore-case
2278 @opindex --ignore-case
2279 Ignore differences in case when comparing keys.
2280 With this option, the lines of the input files must be ordered in the same way.
2281 Use @samp{sort -f} to produce this ordering.
2283 @item -1 @var{field}
2284 @itemx -j1 @var{field}
2287 Join on field @var{field} (a positive integer) of file 1.
2289 @item -2 @var{field}
2290 @itemx -j2 @var{field}
2293 Join on field @var{field} (a positive integer) of file 2.
2295 @item -j @var{field}
2296 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
2298 @item -o @var{field-list}@dots{}
2299 Construct each output line according to the format in @var{field-list}.
2300 Each element in @var{field-list} is either the single character @samp{0} or
2301 has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
2302 @samp{2} and @var{n} is a positive field number.
2304 A field specification of @samp{0} denotes the join field.
2305 In most cases, the functionality of the @samp{0} field spec
2306 may be reproduced using the explicit @var{m.n} that corresponds
2307 to the join field. However, when printing unpairable lines
2308 (using either of the @samp{-a} or @samp{-v} options), there is no way
2309 to specify the join field using @var{m.n} in @var{field-list}
2310 if there are unpairable lines in both files.
2311 To give @code{join} that functionality, @sc{POSIX} invented the @samp{0}
2312 field specification notation.
2314 The elements in @var{field-list}
2315 are separated by commas or blanks. Multiple @var{field-list}
2316 arguments can be given after a single @samp{-o} option; the values
2317 of all lists given with @samp{-o} are concatenated together.
2318 All output lines -- including those printed because of any -a or -v
2319 option -- are subject to the specified @var{field-list}.
2322 Use character @var{char} as the input and output field separator.
2324 @item -v @var{file-number}
2325 Print a line for each unpairable line in file @var{file-number}
2326 (either @samp{1} or @samp{2}), instead of the normal output.
2330 In addition, when GNU @code{join} is invoked with exactly one argument,
2331 options @samp{--help} and @samp{--version} are recognized. @xref{Common
2335 @node Operating on characters
2336 @chapter Operating on characters
2338 @cindex operating on characters
2340 This commands operate on individual characters.
2343 * tr invocation:: Translate, squeeze, and/or delete characters.
2344 * expand invocation:: Convert tabs to spaces.
2345 * unexpand invocation:: Convert spaces to tabs.
2350 @section @code{tr}: Translate, squeeze, and/or delete characters
2357 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
2360 @code{tr} copies standard input to standard output, performing
2361 one of the following operations:
2365 translate, and optionally squeeze repeated characters in the result,
2367 squeeze repeated characters,
2371 delete characters, then squeeze repeated characters from the result.
2374 The @var{set1} and (if given) @var{set2} arguments define ordered
2375 sets of characters, referred to below as @var{set1} and @var{set2}. These
2376 sets are the characters of the input that @code{tr} operates on.
2377 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
2378 complement (all of the characters that are not in @var{set1}).
2381 * Character sets:: Specifying sets of characters.
2382 * Translating:: Changing one characters to another.
2383 * Squeezing:: Squeezing repeats and deleting.
2384 * Warnings in tr:: Warning messages.
2388 @node Character sets
2389 @subsection Specifying sets of characters
2391 @cindex specifying sets of characters
2393 The format of the @var{set1} and @var{set2} arguments resembles
2394 the format of regular expressions; however, they are not regular
2395 expressions, only lists of characters. Most characters simply
2396 represent themselves in these strings, but the strings can contain
2397 the shorthands listed below, for convenience. Some of them can be
2398 used only in @var{set1} or @var{set2}, as noted below.
2402 @item Backslash escapes.
2403 @cindex backslash escapes
2405 A backslash followed by a character not listed below causes an error
2424 The character with the value given by @var{ooo}, which is 1 to 3
2433 The notation @samp{@var{m}-@var{n}} expands to all of the characters
2434 from @var{m} through @var{n}, in ascending order. @var{m} should
2435 collate before @var{n}; if it doesn't, an error results. As an example,
2436 @samp{0-9} is the same as @samp{0123456789}. Although GNU @code{tr}
2437 does not support the System V syntax that uses square brackets to
2438 enclose ranges, translations specified in that format will still work as
2439 long as the brackets in @var{string1} correspond to identical brackets
2442 @item Repeated characters.
2443 @cindex repeated characters
2445 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
2446 copies of character @var{c}. Thus, @samp{[y*6]} is the same as
2447 @samp{yyyyyy}. The notation @samp{[@var{c}*]} in @var{string2} expands
2448 to as many copies of @var{c} as are needed to make @var{set2} as long as
2449 @var{set1}. If @var{n} begins with @samp{0}, it is interpreted in
2450 octal, otherwise in decimal.
2452 @item Character classes.
2453 @cindex characters classes
2455 The notation @samp{[:@var{class}:]} expands to all of the characters in
2456 the (predefined) class @var{class}. The characters expand in no
2457 particular order, except for the @code{upper} and @code{lower} classes,
2458 which expand in ascending order. When the @samp{--delete} (@samp{-d})
2459 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
2460 character class can be used in @var{set2}. Otherwise, only the
2461 character classes @code{lower} and @code{upper} are accepted in
2462 @var{set2}, and then only if the corresponding character class
2463 (@code{upper} and @code{lower}, respectively) is specified in the same
2464 relative position in @var{set1}. Doing this specifies case conversion.
2465 The class names are given below; an error results when an invalid class
2477 Horizontal whitespace.
2486 Printable characters, not including space.
2492 Printable characters, including space.
2495 Punctuation characters.
2498 Horizontal or vertical whitespace.
2507 @item Equivalence classes.
2508 @cindex equivalence classes
2510 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
2511 equivalent to @var{c}, in no particular order. Equivalence classes are
2512 a relatively recent invention intended to support non-English alphabets.
2513 But there seems to be no standard way to define them or determine their
2514 contents. Therefore, they are not fully implemented in GNU @code{tr};
2515 each character's equivalence class consists only of that character,
2516 which is of no particular use.
2522 @subsection Translating
2524 @cindex translating characters
2526 @code{tr} performs translation when @var{set1} and @var{set2} are
2527 both given and the @samp{--delete} (@samp{-d}) option is not given.
2528 @code{tr} translates each character of its input that is in @var{set1}
2529 to the corresponding character in @var{set2}. Characters not in
2530 @var{set1} are passed through unchanged. When a character appears more
2531 than once in @var{set1} and the corresponding characters in @var{set2}
2532 are not all the same, only the final one is used. For example, these
2533 two commands are equivalent:
2540 A common use of @code{tr} is to convert lowercase characters to
2541 uppercase. This can be done in many ways. Here are three of them:
2544 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
2546 tr '[:lower:]' '[:upper:]'
2549 When @code{tr} is performing translation, @var{set1} and @var{set2}
2550 typically have the same length. If @var{set1} is shorter than
2551 @var{set2}, the extra characters at the end of @var{set2} are ignored.
2553 On the other hand, making @var{set1} longer than @var{set2} is not
2554 portable; @sc{POSIX.2} says that the result is undefined. In this situation,
2555 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
2556 the last character of @var{set2} as many times as necessary. System V
2557 @code{tr} truncates @var{set1} to the length of @var{set2}.
2559 By default, GNU @code{tr} handles this case like BSD @code{tr}. When
2560 the @samp{--truncate-set1} (@samp{-t}) option is given, GNU @code{tr}
2561 handles this case like the System V @code{tr} instead. This option is
2562 ignored for operations other than translation.
2564 Acting like System V @code{tr} in this case breaks the relatively common
2568 tr -cs A-Za-z0-9 '\012'
2572 because it converts only zero bytes (the first element in the
2573 complement of @var{set1}), rather than all non-alphanumerics, to
2578 @subsection Squeezing repeats and deleting
2580 @cindex squeezing repeat characters
2581 @cindex deleting characters
2583 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
2584 removes any input characters that are in @var{set1}.
2586 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
2587 @code{tr} replaces each input sequence of a repeated character that
2588 is in @var{set1} with a single occurrence of that character.
2590 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
2591 first performs any deletions using @var{set1}, then squeezes repeats
2592 from any remaining characters using @var{set2}.
2594 The @samp{--squeeze-repeats} option may also be used when translating,
2595 in which case @code{tr} first performs translation, then squeezes
2596 repeats from any remaining characters using @var{set2}.
2598 Here are some examples to illustrate various combinations of options:
2603 Remove all zero bytes:
2610 Put all words on lines by themselves. This converts all
2611 non-alphanumeric characters to newlines, then squeezes each string
2612 of repeated newlines into a single newline:
2615 tr -cs '[a-zA-Z0-9]' '[\n*]'
2619 Convert each sequence of repeated newlines to a single newline:
2628 @node Warnings in tr
2629 @subsection Warning messages
2631 @vindex POSIXLY_CORRECT
2632 Setting the environment variable @code{POSIXLY_CORRECT} turns off the
2633 following warning and error messages, for strict compliance with
2634 @sc{POSIX.2}. Otherwise, the following diagnostics are issued:
2639 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
2640 is not, and @var{set2} is given, GNU @code{tr} by default prints
2641 a usage message and exits, because @var{set2} would not be used.
2642 The @sc{POSIX} specification says that @var{set2} must be ignored in
2643 this case. Silently ignoring arguments is a bad idea.
2646 When an ambiguous octal escape is given. For example, @samp{\400}
2647 is actually @samp{\40} followed by the digit @samp{0}, because the
2648 value 400 octal does not fit into a single byte.
2652 GNU @code{tr} does not provide complete BSD or System V compatibility.
2653 For example, it is impossible to disable interpretation of the @sc{POSIX}
2654 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}. Also, GNU
2655 @code{tr} does not delete zero bytes automatically, unlike traditional
2656 Unix versions, which provide no way to preserve zero bytes.
2659 @node expand invocation
2660 @section @code{expand}: Convert tabs to spaces
2663 @cindex tabs to spaces, converting
2664 @cindex converting tabs to spaces
2666 @code{expand} writes the contents of each given @var{file}, or standard
2667 input if none are given or for a @var{file} of @samp{-}, to standard
2668 output, with tab characters converted to the appropriate number of
2672 expand [@var{option}]@dots{} [@var{file}]@dots{}
2675 By default, @code{expand} converts all tabs to spaces. It preserves
2676 backspace characters in the output; they decrement the column count for
2677 tab calculations. The default action is equivalent to @samp{-8} (set
2678 tabs every 8 columns).
2680 The program accepts the following options. Also see @ref{Common options}.
2684 @item -@var{tab1}[,@var{tab2}]@dots{}
2685 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2686 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2690 @cindex tabstops, setting
2691 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2692 (default is 8). Otherwise, set the tabs at columns @var{tab1},
2693 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
2694 last tabstop given with single spaces. If the tabstops are specified
2695 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2696 blanks as well as by commas.
2702 @cindex initial tabs, converting
2703 Only convert initial tabs (those that precede all non-space or non-tab
2704 characters) on each line to spaces.
2709 @node unexpand invocation
2710 @section @code{unexpand}: Convert spaces to tabs
2714 @code{unexpand} writes the contents of each given @var{file}, or
2715 standard input if none are given or for a @var{file} of @samp{-}, to
2716 standard output, with strings of two or more space or tab characters
2717 converted to as many tabs as possible followed by as many spaces as are
2721 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
2724 By default, @code{unexpand} converts only initial spaces and tabs (those
2725 that precede all non space or tab characters) on each line. It
2726 preserves backspace characters in the output; they decrement the column
2727 count for tab calculations. By default, tabs are set at every 8th
2730 The program accepts the following options. Also see @ref{Common options}.
2734 @item -@var{tab1}[,@var{tab2}]@dots{}
2735 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2736 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2740 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2741 instead of the default 8. Otherwise, set the tabs at columns
2742 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
2743 tabs beyond the tabstops given unchanged. If the tabstops are specified
2744 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2745 blanks as well as by commas. This option implies the @samp{-a} option.
2751 Convert all strings of two or more spaces or tabs, not just initial
2759 @node Opening the software toolbox
2760 @chapter Opening the software toolbox
2762 This chapter originally appeared in @cite{Linux Journal}, volume 1,
2763 number 2, in the @cite{What's GNU?} column. It was written by Arnold
2767 * Toolbox introduction::
2769 * The @code{who} command::
2770 * The @code{cut} command::
2771 * The @code{sort} command::
2772 * The @code{uniq} command::
2773 * Putting the tools together::
2777 @node Toolbox introduction
2778 @unnumberedsec Toolbox introduction
2780 This month's column is only peripherally related to the GNU Project, in
2781 that it describes a number of the GNU tools on your Linux system and how they
2782 might be used. What it's really about is the ``Software Tools'' philosophy
2783 of program development and usage.
2785 The software tools philosophy was an important and integral concept
2786 in the initial design and development of Unix (of which Linux and GNU are
2787 essentially clones). Unfortunately, in the modern day press of
2788 Internetworking and flashy GUIs, it seems to have fallen by the
2789 wayside. This is a shame, since it provides a powerful mental model
2790 for solving many kinds of problems.
2792 Many people carry a Swiss Army knife around in their pants pockets (or
2793 purse). A Swiss Army knife is a handy tool to have: it has several knife
2794 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
2795 a number of other things on it. For the everyday, small miscellaneous jobs
2796 where you need a simple, general purpose tool, it's just the thing.
2798 On the other hand, an experienced carpenter doesn't build a house using
2799 a Swiss Army knife. Instead, he has a toolbox chock full of specialized
2800 tools---a saw, a hammer, a screwdriver, a plane, and so on. And he knows
2801 exactly when and where to use each tool; you won't catch him hammering nails
2802 with the handle of his screwdriver.
2804 The Unix developers at Bell Labs were all professional programmers and trained
2805 computer scientists. They had found that while a one-size-fits-all program
2806 might appeal to a user because there's only one program to use, in practice
2814 difficult to maintain and
2818 difficult to extend to meet new situations.
2821 Instead, they felt that programs should be specialized tools. In short, each
2822 program ``should do one thing well.'' No more and no less. Such programs are
2823 simpler to design, write, and get right---they only do one thing.
2825 Furthermore, they found that with the right machinery for hooking programs
2826 together, that the whole was greater than the sum of the parts. By combining
2827 several special purpose programs, you could accomplish a specific task
2828 that none of the programs was designed for, and accomplish it much more
2829 quickly and easily than if you had to write a special purpose program.
2830 We will see some (classic) examples of this further on in the column.
2831 (An important additional point was that, if necessary, take a detour
2832 and build any software tools you may need first, if you don't already
2833 have something appropriate in the toolbox.)
2835 @node I/O redirection
2836 @unnumberedsec I/O redirection
2838 Hopefully, you are familiar with the basics of I/O redirection in the
2839 shell, in particular the concepts of ``standard input,'' ``standard output,''
2840 and ``standard error''. Briefly, ``standard input'' is a data source, where
2841 data comes from. A program should not need to either know or care if the
2842 data source is a disk file, a keyboard, a magnetic tape, or even a punched
2843 card reader. Similarly, ``standard output'' is a data sink, where data goes
2844 to. The program should neither know nor care where this might be.
2845 Programs that only read their standard input, do something to the data,
2846 and then send it on, are called ``filters'', by analogy to filters in a
2849 With the Unix shell, it's very easy to set up data pipelines:
2852 program_to_create_data | filter1 | .... | filterN > final.pretty.data
2855 We start out by creating the raw data; each filter applies some successive
2856 transformation to the data, until by the time it comes out of the pipeline,
2857 it is in the desired form.
2859 This is fine and good for standard input and standard output. Where does the
2860 standard error come in to play? Well, think about @code{filter1} in
2861 the pipeline above. What happens if it encounters an error in the data it
2862 sees? If it writes an error message to standard output, it will just
2863 disappear down the pipeline into @code{filter2}'s input, and the
2864 user will probably never see it. So programs need a place where they can send
2865 error messages so that the user will notice them. This is standard error,
2866 and it is usually connected to your console or window, even if you have
2867 redirected standard output of your program away from your screen.
2869 For filter programs to work together, the format of the data has to be
2870 agreed upon. The most straightforward and easiest format to use is simply
2871 lines of text. Unix data files are generally just streams of bytes, with
2872 lines delimited by the @sc{ASCII} @sc{LF} (Line Feed) character,
2873 conventionally called a ``newline'' in the Unix literature. (This is
2874 @code{'\n'} if you're a C programmer.) This is the format used by all
2875 the traditional filtering programs. (Many earlier operating systems
2876 had elaborate facilities and special purpose programs for managing
2877 binary data. Unix has always shied away from such things, under the
2878 philosophy that it's easiest to simply be able to view and edit your
2879 data with a text editor.)
2881 OK, enough introduction. Let's take a look at some of the tools, and then
2882 we'll see how to hook them together in interesting ways. In the following
2883 discussion, we will only present those command line options that interest
2884 us. As you should always do, double check your system documentation
2887 @node The @code{who} command
2888 @unnumberedsec The @code{who} command
2890 The first program is the @code{who} command. By itself, it generates a
2891 list of the users who are currently logged in. Although I'm writing
2892 this on a single-user system, we'll pretend that several people are
2897 arnold console Jan 22 19:57
2898 miriam ttyp0 Jan 23 14:19(:0.0)
2899 bill ttyp1 Jan 21 09:32(:0.0)
2900 arnold ttyp2 Jan 23 20:48(:0.0)
2903 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
2904 There are three people logged in, and I am logged in twice. On traditional
2905 Unix systems, user names are never more than eight characters long. This
2906 little bit of trivia will be useful later. The output of @code{who} is nice,
2907 but the data is not all that exciting.
2909 @node The @code{cut} command
2910 @unnumberedsec The @code{cut} command
2912 The next program we'll look at is the @code{cut} command. This program
2913 cuts out columns or fields of input data. For example, we can tell it
2914 to print just the login name and full name from the @file{/etc/passwd
2915 file}. The @file{/etc/passwd} file has seven fields, separated by
2919 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
2922 To get the first and fifth fields, we would use cut like this:
2925 $ cut -d: -f1,5 /etc/passwd
2928 arnold:Arnold D. Robbins
2929 miriam:Miriam A. Robbins
2933 With the @samp{-c} option, @code{cut} will cut out specific characters
2934 (i.e., columns) in the input lines. This command looks like it might be
2935 useful for data filtering.
2938 @node The @code{sort} command
2939 @unnumberedsec The @code{sort} command
2941 Next we'll look at the @code{sort} command. This is one of the most
2942 powerful commands on a Unix-style system; one that you will often find
2943 yourself using when setting up fancy data plumbing. The @code{sort}
2944 command reads and sorts each file named on the command line. It then
2945 merges the sorted data and writes it to standard output. It will read
2946 standard input if no files are given on the command line (thus
2947 making it into a filter). The sort is based on the machine collating
2948 sequence (@sc{ASCII}) or based on user-supplied ordering criteria.
2951 @node The @code{uniq} command
2952 @unnumberedsec The @code{uniq} command
2954 Finally (at least for now), we'll look at the @code{uniq} program. When
2955 sorting data, you will often end up with duplicate lines, lines that
2956 are identical. Usually, all you need is one instance of each line.
2957 This is where @code{uniq} comes in. The @code{uniq} program reads its
2958 standard input, which it expects to be sorted. It only prints out one
2959 copy of each duplicated line. It does have several options. Later on,
2960 we'll use the @samp{-c} option, which prints each unique line, preceded
2961 by a count of the number of times that line occurred in the input.
2964 @node Putting the tools together
2965 @unnumberedsec Putting the tools together
2967 Now, let's suppose this is a large BBS system with dozens of users
2968 logged in. The management wants the SysOp to write a program that will
2969 generate a sorted list of logged in users. Furthermore, even if a user
2970 is logged in multiple times, his or her name should only show up in the
2973 The SysOp could sit down with the system documentation and write a C
2974 program that did this. It would take perhaps a couple of hundred lines
2975 of code and about two hours to write it, test it, and debug it.
2976 However, knowing the software toolbox, the SysOp can instead start out
2977 by generating just a list of logged on users:
2987 Next, sort the list:
2990 $ who | cut -c1-8 | sort
2997 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
3000 $ who | cut -c1-8 | sort | uniq
3006 The @code{sort} command actually has a @samp{-u} option that does what
3007 @code{uniq} does. However, @code{uniq} has other uses for which one
3008 cannot substitute @samp{sort -u}.
3010 The SysOp puts this pipeline into a shell script, and makes it available for
3011 all the users on the system:
3014 # cat > /usr/local/bin/listusers
3015 who | cut -c1-8 | sort | uniq
3017 # chmod +x /usr/local/bin/listusers
3020 There are four major points to note here. First, with just four
3021 programs, on one command line, the SysOp was able to save about two
3022 hours worth of work. Furthermore, the shell pipeline is just about as
3023 efficient as the C program would be, and it is much more efficient in
3024 terms of programmer time. People time is much more expensive than
3025 computer time, and in our modern ``there's never enough time to do
3026 everything'' society, saving two hours of programmer time is no mean
3029 Second, it is also important to emphasize that with the
3030 @emph{combination} of the tools, it is possible to do a special
3031 purpose job never imagined by the authors of the individual programs.
3033 Third, it is also valuable to build up your pipeline in stages, as we did here.
3034 This allows you to view the data at each stage in the pipeline, which helps
3035 you acquire the confidence that you are indeed using these tools correctly.
3037 Finally, by bundling the pipeline in a shell script, other users can use
3038 your command, without having to remember the fancy plumbing you set up for
3039 them. In terms of how you run them, shell scripts and compiled programs are
3042 After the previous warm-up exercise, we'll look at two additional, more
3043 complicated pipelines. For them, we need to introduce two more tools.
3045 The first is the @code{tr} command, which stands for ``transliterate.''
3046 The @code{tr} command works on a character-by-character basis, changing
3047 characters. Normally it is used for things like mapping upper case to
3051 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]'
3052 this example has mixed case!
3055 There are several options of interest:
3059 work on the complement of the listed characters, i.e.,
3060 operations apply to characters not in the given set
3063 delete characters in the first set from the output
3066 squeeze repeated characters in the output into just one character.
3069 We will be using all three options in a moment.
3071 The other command we'll look at is @code{comm}. The @code{comm}
3072 command takes two sorted input files as input data, and prints out the
3073 files' lines in three columns. The output columns are the data lines
3074 unique to the first file, the data lines unique to the second file, and
3075 the data lines that are common to both. The @samp{-1}, @samp{-2}, and
3076 @samp{-3} command line options omit the respective columns. (This is
3077 non-intuitive and takes a little getting used to.) For example:
3099 The single dash as a filename tells @code{comm} to read standard input
3100 instead of a regular file.
3102 Now we're ready to build a fancy pipeline. The first application is a word
3103 frequency counter. This helps an author determine if he or she is over-using
3106 The first step is to change the case of all the letters in our input file
3107 to one case. ``The'' and ``the'' are the same word when doing counting.
3110 $ tr '[A-Z]' '[a-z]' < whats.gnu | ...
3113 The next step is to get rid of punctuation. Quoted words and unquoted words
3114 should be treated identically; it's easiest to just get the punctuation out of
3118 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
3121 The second @code{tr} command operates on the complement of the listed
3122 characters, which are all the letters, the digits, the underscore, and
3123 the blank. The @samp{\012} represents the newline character; it has to
3124 be left alone. (The ASCII TAB character should also be included for
3125 good measure in a production script.)
3127 At this point, we have data consisting of words separated by blank space.
3128 The words only contain alphanumeric characters (and the underscore). The
3129 next step is break the data apart so that we have one word per line. This
3130 makes the counting operation much easier, as we will see shortly.
3133 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3134 > tr -s '[ ]' '\012' | ...
3137 This command turns blanks into newlines. The @samp{-s} option squeezes
3138 multiple newline characters in the output into just one. This helps us
3139 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
3140 This is what the shell prints when it notices you haven't finished
3141 typing in all of a command.)
3143 We now have data consisting of one word per line, no punctuation, all one
3144 case. We're ready to count each word:
3147 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3148 > tr -s '[ ]' '\012' | sort | uniq -c | ...
3151 At this point, the data might look something like this:
3164 The output is sorted by word, not by count! What we want is the most
3165 frequently used words first. Fortunately, this is easy to accomplish,
3166 with the help of two more @code{sort} options:
3170 do a numeric sort, not an ASCII one
3173 reverse the order of the sort
3176 The final pipeline looks like this:
3179 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3180 > tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
3189 Whew! That's a lot to digest. Yet, the same principles apply. With six
3190 commands, on two lines (really one long one split for convenience), we've
3191 created a program that does something interesting and useful, in much
3192 less time than we could have written a C program to do the same thing.
3194 A minor modification to the above pipeline can give us a simple spelling
3195 checker! To determine if you've spelled a word correctly, all you have to
3196 do is look it up in a dictionary. If it is not there, then chances are
3197 that your spelling is incorrect. So, we need a dictionary. If you
3198 have the Slackware Linux distribution, you have the file
3199 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
3202 Now, how to compare our file with the dictionary? As before, we generate
3203 a sorted list of words, one per line:
3206 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3207 > tr -s '[ ]' '\012' | sort -u | ...
3210 Now, all we need is a list of words that are @emph{not} in the
3211 dictionary. Here is where the @code{comm} command comes in.
3214 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3215 > tr -s '[ ]' '\012' | sort -u |
3216 > comm -23 - /usr/lib/ispell/ispell.words
3219 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
3220 dictionary (the second file), and lines that are in both files. Lines
3221 only in the first file (standard input, our stream of words), are
3222 words that are not in the dictionary. These are likely candidates for
3223 spelling errors. This pipeline was the first cut at a production
3224 spelling checker on Unix.
3226 There are some other tools that deserve brief mention.
3230 search files for text that matches a regular expression
3233 like @code{grep}, but with more powerful regular expressions
3236 count lines, words, characters
3239 a T-fitting for data pipes, copies data to files and to standard output
3242 the stream editor, an advanced tool
3245 a data manipulation language, another advanced tool
3248 The software tools philosophy also espoused the following bit of
3249 advice: ``Let someone else do the hard part.'' This means, take
3250 something that gives you most of what you need, and then massage it the
3251 rest of the way until it's in the form that you want.
3257 Each program should do one thing well. No more, no less.
3260 Combining programs with appropriate plumbing leads to results where
3261 the whole is greater than the sum of the parts. It also leads to novel
3262 uses of programs that the authors might never have imagined.
3265 Programs should never print extraneous header or trailer data, since these
3266 could get sent on down a pipeline. (A point we didn't mention earlier.)
3269 Let someone else do the hard part.
3272 Know your toolbox! Use each program appropriately. If you don't have an
3273 appropriate tool, build one.
3276 As of this writing, all the programs we've discussed are available via
3277 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
3278 @file{/pub/gnu/textutils-1.9.tar.gz} directory.@footnote{Version 1.9 was
3279 current when this column was written. Check the nearest GNU archive for
3280 the current version.}
3282 None of what I have presented in this column is new. The Software Tools
3283 philosophy was first introduced in the book @cite{Software Tools},
3284 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
3285 0-201-03669-X). This book showed how to write and use software
3286 tools. It was written in 1976, using a preprocessor for FORTRAN named
3287 @code{ratfor} (RATional FORtran). At the time, C was not as ubiquitous
3288 as it is now; FORTRAN was. The last chapter presented a @code{ratfor}
3289 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
3290 awful lot like C; if you know C, you won't have any problem following
3293 In 1981, the book was updated and made available as @cite{Software
3294 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7). Both books
3295 remain in print, and are well worth reading if you're a programmer.
3296 They certainly made a major change in how I view programming.
3298 Initially, the programs in both books were available (on 9-track tape)
3299 from Addison-Wesley. Unfortunately, this is no longer the case,
3300 although you might be able to find copies floating around the Internet.
3301 For a number of years, there was an active Software Tools Users Group,
3302 whose members had ported the original @code{ratfor} programs to essentially
3303 every computer system with a FORTRAN compiler. The popularity of the
3304 group waned in the middle '80s as Unix began to spread beyond universities.
3306 With the current proliferation of GNU code and other clones of Unix programs,
3307 these programs now receive little attention; modern C versions are
3308 much more efficient and do more than these programs do. Nevertheless, as
3309 exposition of good programming style, and evangelism for a still-valuable
3310 philosophy, these books are unparalleled, and I recommend them highly.
3312 Acknowledgment: I would like to express my gratitude to Brian Kernighan
3313 of Bell Labs, the original Software Toolsmith, for reviewing this column.
3325 @c texinfo-column-for-description: 32