3 @setfilename textutils.info
4 @settitle GNU text utilities
12 @c Put everything in one index (arbitrarily chosen to be the concept index).
20 @set Francois Franc,ois
23 @set Francois Fran\noexpand\ptexc cois
29 * Text utilities: (textutils). GNU text utilities.
30 * cat: (textutils)cat invocation. Concatenate and write files.
31 * cksum: (textutils)cksum invocation. Print POSIX CRC checksum.
32 * comm: (textutils)comm invocation. Compare sorted files by line.
33 * csplit: (textutils)csplit invocation. Split by context.
34 * cut: (textutils)cut invocation. Print selected parts of lines.
35 * expand: (textutils)expand invocation. Convert tabs to spaces.
36 * fmt: (textutils)fmt invocation. Reformat paragraph text.
37 * fold: (textutils)fold invocation. Wrap long input lines.
38 * head: (textutils)head invocation. Output the first part of files.
39 * join: (textutils)join invocation. Join lines on a common field.
40 * md5sum: (textutils)md5sum invocation. Print or check message-digests.
41 * nl: (textutils)nl invocation. Number lines and write files.
42 * od: (textutils)od invocation. Dump files in octal, etc.
43 * paste: (textutils)paste invocation. Merge lines of files.
44 * pr: (textutils)pr invocation. Paginate or columnate files.
45 * sort: (textutils)sort invocation. Sort text files.
46 * split: (textutils)split invocation. Split into fixed-size pieces.
47 * sum: (textutils)sum invocation. Print traditional checksum.
48 * tac: (textutils)tac invocation. Reverse files.
49 * tail: (textutils)tail invocation. Output the last part of files.
50 * tr: (textutils)tr invocation. Translate characters.
51 * unexpand: (textutils)unexpand invocation. Convert spaces to tabs.
52 * uniq: (textutils)uniq invocation. Uniqify files.
53 * wc: (textutils)wc invocation. Byte, word, and line counts.
59 This file documents the GNU text utilities.
61 Copyright (C) 1994, 95 Free Software Foundation, Inc.
63 Permission is granted to make and distribute verbatim copies of
64 this manual provided the copyright notice and this permission notice
65 are preserved on all copies.
68 Permission is granted to process this file through TeX and print the
69 results, provided the printed document carries copying permission
70 notice identical to this one except for the removal of this paragraph
71 (this paragraph not being relevant to the printed manual).
74 Permission is granted to copy and distribute modified versions of this
75 manual under the conditions for verbatim copying, provided that the entire
76 resulting derived work is distributed under the terms of a permission
77 notice identical to this one.
79 Permission is granted to copy and distribute translations of this manual
80 into another language, under the above conditions for modified versions,
81 except that this permission notice may be stated in a translation approved
86 @title GNU @code{textutils}
87 @subtitle A set of text utilities
88 @subtitle for version @value{VERSION}, @value{RELEASEDATE}
89 @author David MacKenzie et al.
92 @vskip 0pt plus 1filll
93 Copyright @copyright{} 1994, 95 Free Software Foundation, Inc.
95 Permission is granted to make and distribute verbatim copies of
96 this manual provided the copyright notice and this permission notice
97 are preserved on all copies.
99 Permission is granted to copy and distribute modified versions of this
100 manual under the conditions for verbatim copying, provided that the entire
101 resulting derived work is distributed under the terms of a permission
102 notice identical to this one.
104 Permission is granted to copy and distribute translations of this manual
105 into another language, under the above conditions for modified versions,
106 except that this permission notice may be stated in a translation approved
113 @top GNU text utilities
115 @cindex text utilities
116 @cindex utilities for text handling
118 This manual minimally documents version @value{VERSION} of the GNU text
122 * Introduction:: Caveats, overview, and authors.
123 * Common options:: Common options.
124 * Output of entire files:: cat tac nl od
125 * Formatting file contents:: fmt pr fold
126 * Output of parts of files:: head tail split csplit
127 * Summarizing files:: wc sum cksum md5sum
128 * Operating on sorted files:: sort uniq comm
129 * Operating on fields within a line:: cut paste join
130 * Operating on characters:: tr expand unexpand
131 * Opening the software toolbox:: The software tools philosophy.
132 * Index:: General index.
138 @chapter Introduction
142 This manual is incomplete: No attempt is made to explain basic concepts
143 in a way suitable for novices. Thus, if you are interested, please get
144 involved in improving this manual. The entire GNU community will
148 The GNU text utilities are mostly compatible with the POSIX.2 standard.
150 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
151 @c sh-utils.texi too -- so be sure to keep them consistent.
152 @cindex bugs, reporting
153 Please report bugs to @samp{bug-gnu-utils@@prep.ai.mit.edu}. Remember
154 to include the version number, machine architecture, input files, and
155 any other information needed to reproduce the bug: your input, what you
156 expected, what you got, and why it is wrong. Diffs are welcome, but
157 please include a description of the problem as well, since this is
158 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
160 This manual is based on the Unix man pages in the distribution, which
161 were originally written by David MacKenzie and updated by Jim Meyering.
162 The original @code{fmt} man page was written by Ross Paterson.
163 @value{Francois} Pinard did the initial conversion to Texinfo format.
164 Karl Berry did the indexing, some reorganization, and editing of the results.
165 Richard Stallman contributed his usual invaluable insights to the
170 @chapter Common options
172 @cindex common options
174 Certain options are available in all these programs. Rather than
175 writing identical descriptions for each of the programs, they are
176 described here. (In fact, every GNU program accepts (or should accept)
179 A few of these programs take arbitrary strings as arguments. In those
180 cases, @samp{--help} and @samp{--version} are taken as these options
181 only if there is one and exactly one command line argument.
188 Print a usage message listing all available options, then exit successfully.
192 @cindex version number, finding
193 Print the version number, then exit successfully.
198 @node Output of entire files
199 @chapter Output of entire files
201 @cindex output of entire files
202 @cindex entire files, output of
204 These commands read and write entire files, possibly transforming them
208 * cat invocation:: Concatenate and write files.
209 * tac invocation:: Concatenate and write files in reverse.
210 * nl invocation:: Number lines and write files.
211 * od invocation:: Write files in octal or other formats.
215 @section @code{cat}: Concatenate and write files
218 @cindex concatenate and write files
219 @cindex copying files
221 @code{cat} copies each @var{file} (@samp{-} means standard input), or
222 standard input if none are given, to standard output. Synopsis:
225 cat [@var{option}] [@var{file}]@dots{}
228 The program accepts the following options. Also see @ref{Common options}.
236 Equivalent to @samp{-vET}.
239 @itemx --number-nonblank
241 @opindex --number-nonblank
242 Number all nonblank output lines, starting with 1.
246 Equivalent to @samp{-vE}.
252 Display a @samp{$} after the end of each line.
258 Number all output lines, starting with 1.
261 @itemx --squeeze-blank
263 @opindex --squeeze-blank
264 @cindex squeezing blank lines
265 Replace multiple adjacent blank lines with a single blank line.
269 Equivalent to @samp{-vT}.
275 Display @key{TAB} characters as @samp{^I}.
279 Ignored; for Unix compatibility.
282 @itemx --show-nonprinting
284 @opindex --show-nonprinting
285 Display control characters except for @key{LFD} and @key{TAB} using
286 @samp{^} notation and precede characters that have the high bit set
293 @section @code{tac}: Concatenate and write files in reverse
296 @cindex reversing files
298 @code{tac} copies each @var{file} (@samp{-} means standard input), or
299 standard input if none are given, to standard output, reversing the
300 records (lines by default) in each separately. Synopsis:
303 tac [@var{option}]@dots{} [@var{file}]@dots{}
306 @dfn{Records} are separated by instances of a string (newline by
307 default). By default, this separator string is attached to the end of
308 the record that it follows in the file.
310 The program accepts the following options. Also see @ref{Common options}.
318 The separator is attached to the beginning of the record that it
319 precedes in the file.
325 Treat the separator string as a regular expression.
327 @item -s @var{separator}
328 @itemx --separator=@var{separator}
331 Use @var{separator} as the record separator, instead of newline.
337 @section @code{nl}: Number lines and write files
340 @cindex numbering lines
341 @cindex line numbering
343 @code{nl} writes each @var{file} (@samp{-} means standard input), or
344 standard input if none are given, to standard output, with line numbers
345 added to some or all of the lines. Synopsis:
348 nl [@var{option}]@dots{} [@var{file}]@dots{}
351 @cindex logical pages, numbering on
352 @code{nl} decomposes its input into (logical) pages; by default, the
353 line number is reset to 1 at the top of each logical page. @code{nl}
354 treats all of the input files as a single document; it does not reset
355 line numbers or logical pages between files.
357 @cindex headers, numbering
358 @cindex body, numbering
359 @cindex footers, numbering
360 A logical page consists of three sections: header, body, and footer.
361 Any of the sections can be empty. Each can be numbered in a different
362 style from the others.
364 The beginnings of the sections of logical pages are indicated in the
365 input file by a line containing exactly one of these delimiter strings:
376 The two characters from which these strings are made can be changed from
377 @samp{\} and @samp{:} via options (see below), but the pattern and
378 length of each string cannot be changed.
380 A section delimiter is replaced by an empty line on output. Any text
381 that comes before the first section delimiter string in the input file
382 is considered to be part of a body section, so @code{nl} treats a
383 file that contains no section delimiters as a single body section.
385 The program accepts the following options. Also see @ref{Common options}.
390 @itemx --body-numbering=@var{style}
392 @opindex --body-numbering
393 Select the numbering style for lines in the body section of each
394 logical page. When a line is not numbered, the current line number
395 is not incremented, but the line number separator character is still
396 prepended to the line. The styles are:
402 number only nonempty lines (default for body),
404 do not number lines (default for header and footer),
406 number only lines that contain a match for @var{regexp}.
410 @itemx --section-delimiter=@var{cd}
412 @opindex --section-delimiter
413 @cindex section delimiters of pages
414 Set the section delimiter characters to @var{cd}; default is
415 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
416 (Remember to protect @samp{\} or other metacharacters from shell
417 expansion with quotes or extra backslashes.)
420 @itemx --footer-numbering=@var{style}
422 @opindex --footer-numbering
423 Analogous to @samp{--body-numbering}.
426 @itemx --header-numbering=@var{style}
428 @opindex --header-numbering
429 Analogous to @samp{--body-numbering}.
431 @item -i @var{number}
432 @itemx --page-increment=@var{number}
434 @opindex --page-increment
435 Increment line numbers by @var{number} (default 1).
437 @item -l @var{number}
438 @itemx --join-blank-lines=@var{number}
440 @opindex --join-blank-lines
441 @cindex empty lines, numbering
442 @cindex blank lines, numbering
443 Consider @var{number} (default 1) consecutive empty lines to be one
444 logical line for numbering, and only number the last one. Where fewer
445 than @var{number} consecutive empty lines occur, do not number them.
446 An empty line is one that contains no characters, not even spaces
449 @item -n @var{format}
450 @itemx --number-format=@var{format}
452 @opindex --number-format
453 Select the line numbering format (default is @code{rn}):
457 @opindex ln @r{format for @code{nl}}
458 left justified, no leading zeros;
460 @opindex rn @r{format for @code{nl}}
461 right justified, no leading zeros;
463 @opindex rz @r{format for @code{nl}}
464 right justified, leading zeros.
470 @opindex --no-renumber
471 Do not reset the line number at the start of a logical page.
473 @item -s @var{string}
474 @itemx --number-separator=@var{string}
476 @opindex --number-separator
477 Separate the line number from the text line in the output with
478 @var{string} (default is @key{TAB}).
480 @item -v @var{number}
481 @itemx --first-page=@var{number}
483 @opindex --first-page
484 Set the initial line number on each logical page to @var{number} (default 1).
486 @item -w @var{number}
487 @itemx --number-width=@var{number}
489 @opindex --number-width
490 Use @var{number} characters for line numbers (default 6).
496 @section @code{od}: Write files in octal or other formats
499 @cindex octal dump of files
500 @cindex hex dump of files
501 @cindex ASCII dump of files
502 @cindex file contents, dumping unambiguously
504 @code{od} writes an unambiguous representation of each @var{file}
505 (@samp{-} means standard input), or standard input if none are given.
509 od [@var{option}]@dots{} [@var{file}]@dots{}
510 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
513 Each line of output consists of the offset in the input, followed by
514 groups of data from the file. By default, @code{od} prints the offset in
515 octal, and each group of file data is two bytes of input printed as a
518 The program accepts the following options. Also see @ref{Common options}.
523 @itemx --address-radix=@var{radix}
525 @opindex --address-radix
526 @cindex radix for file offsets
527 @cindex file offset radix
528 Select the base in which file offsets are printed. @var{radix} can
529 be one of the following:
539 none (do not print offsets).
542 The default is octal.
545 @itemx --skip-bytes=@var{bytes}
547 @opindex --skip-bytes
548 Skip @var{bytes} input bytes before formatting and writing. If
549 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
550 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
551 in decimal. Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
552 by 1024, and @samp{m} by 1048576.
555 @itemx --read-bytes=@var{bytes}
557 @opindex --read-bytes
558 Output at most @var{bytes} bytes of the input. Prefixes and suffixes on
559 @code{bytes} are interpreted as for the @samp{-j} option.
562 @itemx --strings[=@var{n}]
565 @cindex string constants, outputting
566 Instead of the normal output, output only @dfn{string constants}: at
567 least @var{n} (3 by default) consecutive ASCII graphic characters,
568 followed by a null (zero) byte.
571 @itemx --format=@var{type}
574 Select the format in which to output the file data. @var{type} is a
575 string of one or more of the below type indicator characters. If you
576 include more than one type indicator character in a single @var{type}
577 string, or use this option more than once, @code{od} writes one copy
578 of each output line using each of the data types that you specified,
579 in the order that you specified.
585 ASCII character or backslash escape,
598 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
599 newline, and @samp{nul} for a null (zero) byte. Type @code{c} outputs
600 @samp{ }, @samp{\n}, and @code{\0}, respectively.
603 Except for types @samp{a} and @samp{c}, you can specify the number
604 of bytes to use in interpreting each number in the given data type
605 by following the type indicator character with a decimal integer.
606 Alternately, you can specify the size of one of the C compiler's
607 built-in data types by following the type indicator character with
608 one of the following characters. For integers (@samp{d}, @samp{o},
622 For floating point (@code{f}):
634 @itemx --output-duplicates
636 @opindex --output-duplicates
637 Output consecutive lines that are identical. By default, when two or
638 more consecutive output lines would be identical, @code{od} outputs only
639 the first line, and puts just an asterisk on the following line to
640 indicate the elision.
643 @itemx --width[=@var{n}]
646 Dump @code{n} input bytes per output line. This must be a multiple of
647 the least common multiple of the sizes associated with the specified
648 output types. If @var{n} is omitted, the default is 32. If this option
649 is not given at all, the default is 16.
653 The next several options map the old, pre-POSIX format specification
654 options to the corresponding POSIX format specs. GNU @code{od} accepts
655 any combination of old- and new-style options. Format specification
662 Output as named characters. Equivalent to @samp{-ta}.
666 Output as octal bytes. Equivalent to @samp{-toC}.
670 Output as ASCII characters or backslash escapes. Equivalent to
675 Output as unsigned decimal shorts. Equivalent to @samp{-tu2}.
679 Output as floats. Equivalent to @samp{-tfF}.
683 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
687 Output as decimal shorts. Equivalent to @samp{-td2}.
691 Output as decimal longs. Equivalent to @samp{-td4}.
695 Output as octal shorts. Equivalent to @samp{-to2}.
699 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
703 @opindex --traditional
704 Recognize the pre-POSIX non-option arguments that traditional @code{od}
705 accepted. The following syntax:
708 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
712 can be used to specify at most one file and optional arguments
713 specifying an offset and a pseudo-start address, @var{label}. By
714 default, @var{offset} is interpreted as an octal number specifying how
715 many input bytes to skip before formatting and writing. The optional
716 trailing decimal point forces the interpretation of @var{offset} as a
717 decimal number. If no decimal is specified and the offset begins with
718 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number. If
719 there is a trailing @samp{b}, the number of bytes skipped will be
720 @var{offset} multiplied by 512. The @var{label} argument is interpreted
721 just like @var{offset}, but it specifies an initial pseudo-address. The
722 pseudo-addresses are displayed in parentheses following any normal
728 @node Formatting file contents
729 @chapter Formatting file contents
731 @cindex formatting file contents
733 These commands reformat the contents of files.
736 * fmt invocation:: Reformat paragraph text.
737 * pr invocation:: Paginate or columnate files for printing.
738 * fold invocation:: Wrap input lines to fit in specified width.
743 @section @code{fmt}: Reformat paragraph text
746 @cindex reformatting paragraph text
747 @cindex paragraphs, reformatting
748 @cindex text, reformatting
750 @code{fmt} fills and joins lines to produce output lines of (at most)
751 a given number of characters (75 by default). Synopsis:
754 fmt [@var{option}]@dots{} [@var{file}]@dots{}
757 @code{fmt} reads from the specified @var{file} arguments (or standard
758 input if none are given), and writes to standard output.
760 By default, blank lines, spaces between words, and indentation are
761 preserved in the output; successive input lines with different
762 indentation are not joined; tabs are expanded on input and introduced on
765 @cindex line-breaking
766 @cindex sentences and line-breaking
767 @cindex Knuth, Donald E.
768 @cindex Plass, Michael F.
769 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
770 avoid line breaks after the first word of a sentence or before the last
771 word of a sentence. A @dfn{sentence break} is defined as either the end
772 of a paragraph or a word ending in any of @samp{.?!}, followed by two
773 spaces or end of line, ignoring any intervening parentheses or quotes.
774 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
775 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
776 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
777 and Experience}, 11 (1981), 1119--1184).
779 The program accepts the following options. Also see @ref{Common options}.
784 @itemx --crown-margin
786 @opindex --crown-margin
788 @dfn{Crown margin} mode: preserve the indentation of the first two
789 lines within a paragraph, and align the left margin of each subsequent
790 line with that of the second line.
793 @itemx --tagged-paragraph
795 @opindex --tagged-paragraph
796 @cindex tagged paragraphs
797 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
798 indentation of the first line of a paragraph is the same as the
799 indentation of the second, the first line is treated as a one-line
805 @opindex --split-only
806 Split lines only. Do not join short lines to form longer ones. This
807 prevents sample lines of code, and other such ``formatted'' text from
808 being unduly combined.
811 @itemx --uniform-spacing
813 @opindex --uniform-spacing
814 Uniform spacing. Reduce spacing between words to one space, and spacing
815 between sentences to two spaces.
818 @itemx -w @var{width}
819 @itemx --width=@var{width}
820 @opindex -@var{width}
823 Fill output lines up to @var{width} characters (default 75). @code{fmt}
824 initially tries to make lines about 7% shorter than this, to give it
825 room to balance line lengths.
827 @item -p @var{prefix}
828 @itemx --prefix=@var{prefix}
829 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
830 are subject to formatting. The prefix and any preceding whitespace is
831 stripped for the formatting and then re-attached to each formatted output
832 line. One use is to format certain kinds of program comments, while
833 leaving the code unchanged.
839 @section @code{pr}: Paginate or columnate files for printing
842 @cindex printing, preparing files for
843 @cindex multicolumn output, generating
845 @code{pr} writes each @var{file} (@samp{-} means standard input), or
846 standard input if none are given, to standard output, paginating and
847 optionally outputting in multicolumn format. Synopsis:
850 pr [@var{option}]@dots{} [@var{file}]@dots{}
853 By default, a 5-line header is printed: two blank lines; a line with the
854 date, the file name, and the page count; and two more blank lines. A
855 five line footer (entirely) is also printed.
857 Form feeds in the input cause page breaks in the output.
859 The program accepts the following options. Also see @ref{Common options}.
864 Begin printing with page @var{page}.
867 @opindex -@var{column}
868 Produce @var{column}-column output and print columns down. The column
869 width is automatically decreased as @var{column} increases; unless you
870 use the @samp{-w} option to increase the page width as well, this option
871 might well cause some input to be truncated.
875 @cindex across columns
876 Print columns across rather than down.
880 @cindex balancing columns
881 Balance columns on the last page.
885 Print control characters using hat notation (e.g., @samp{^G}); print
886 other unprintable characters in octal backslash notation. By default,
887 unprintable characters are not changed.
891 @cindex double spacing
892 Double space the output.
894 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
897 Expand tabs to spaces on input. Optional argument @var{in-tabchar} is
898 the input tab character (default is @key{TAB}). Second optional
899 argument @var{in-tabwidth} is the input tab character's width (default
906 Use a formfeed instead of newlines to separate output pages.
908 @item -h @var{header}
910 Replace the file name in the header with the string @var{header}.
912 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
915 Replace spaces with tabs on output. Optional argument @var{out-tabchar}
916 is the output tab character (default is @key{TAB}). Second optional
917 argument @var{out-tabwidth} is the output tab character's width (default
922 Set the page length to @var{n} (default 66) lines. If @var{n} is less
923 than 10, the headers and footers are omitted, as if the @samp{-t} option
928 Print all files in parallel, one in each column.
930 @item -n[@var{number-separator}[@var{digits}]]
932 Precede each column with a line number; with parallel files (@samp{-m}),
933 precede each line with a line number. Optional argument
934 @var{number-separator} is the character to print after each number
935 (default is @key{TAB}). Optional argument @var{digits} is the number of
936 digits per line number (default is 5).
940 @cindex indenting lines
942 Indent each line with @var{n} (default is zero) spaces wide, i.e., set
943 the left margin. The total page width is @samp{n} plus the width set
944 with the @samp{-w} option.
948 Do not print a warning message when an argument @var{file} cannot be
949 opened. (The exit status will still be nonzero, however.)
953 Separate columns by the single character @var{c}. If @var{c} is
954 omitted, the default is space; if this option is omitted altogether, the
955 default is @key{TAB}.
959 Do not print the usual 5-line header and the 5-line footer on each page,
960 and do not fill out the bottoms of pages (with blank lines or
965 Print unprintable characters in octal backslash notation.
969 Set the page width to @var{n} (default is 72) columns.
974 @node fold invocation
975 @section @code{fold}: Wrap input lines to fit in specified width
978 @cindex wrapping long input lines
979 @cindex folding long input lines
981 @code{fold} writes each @var{file} (@samp{-} means standard input), or
982 standard input if none are given, to standard output, breaking long
986 fold [@var{option}]@dots{} [@var{file}]@dots{}
989 By default, @code{fold} breaks lines wider than 80 columns. The output
990 is split into as many lines as necessary.
992 @cindex screen columns
993 @code{fold} counts screen columns by default; thus, a tab may count more
994 than one column, backspace decreases the column count, and carriage
995 return sets the column to zero.
997 The program accepts the following options. Also see @ref{Common options}.
1005 Count bytes rather than columns, so that tabs, backspaces, and carriage
1006 returns are each counted as taking up one column, just like other
1013 Break at word boundaries: the line is broken after the last blank before
1014 the maximum line length. If the line contains no such blanks, the line
1015 is broken at the maximum line length as usual.
1017 @item -w @var{width}
1018 @itemx --width=@var{width}
1021 Use a maximum line length of @var{width} columns instead of 80.
1026 @node Output of parts of files
1027 @chapter Output of parts of files
1029 @cindex output of parts of files
1030 @cindex parts of files, output of
1032 These commands output pieces of the input.
1035 * head invocation:: Output the first part of files.
1036 * tail invocation:: Output the last part of files.
1037 * split invocation:: Split a file into fixed-size pieces.
1038 * csplit invocation:: Split a file into context-determined pieces.
1041 @node head invocation
1042 @section @code{head}: Output the first part of files
1045 @cindex initial part of files, outputting
1046 @cindex first part of files, outputting
1048 @code{head} prints the first part (10 lines by default) of each
1049 @var{file}; it reads from standard input if no files are given or
1050 when given a @var{file} of @samp{-}. Synopses:
1053 head [@var{option}]@dots{} [@var{file}]@dots{}
1054 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1057 If more than one @var{file} is specicified, @code{head} prints a
1058 one-line header consisting of
1060 ==> @var{file name} <==
1063 before the output for each @var{file}.
1065 @code{head} accepts two option formats: the new one, in which numbers
1066 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1067 the number precedes any option letters (@samp{-1q}).
1069 The program accepts the following options. Also see @ref{Common options}.
1073 @item -@var{count}@var{options}
1074 @opindex -@var{count}
1075 This option is only recognized if it is specified first. @var{count} is
1076 a decimal number optionally followed by a size letter (@samp{b},
1077 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1078 or other option letters (@samp{cqv}).
1080 @item -c @var{bytes}
1081 @itemx --bytes=@var{bytes}
1084 Print the first @var{bytes} bytes, instead of initial lines. Appending
1085 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1089 @itemx --lines=@var{n}
1092 Output the first @var{n} lines.
1100 Never print file name headers.
1106 Always print file name headers.
1111 @node tail invocation
1112 @section @code{tail}: Output the last part of files
1115 @cindex last part of files, outputting
1117 @code{tail} prints the last part (10 lines by default) of each
1118 @var{file}; it reads from standard input if no files are given or
1119 when given a @var{file} of @samp{-}. Synopses:
1122 tail [@var{option}]@dots{} [@var{file}]@dots{}
1123 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1124 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1127 If more than one @var{file} is specified, @code{tail} prints a
1128 one-line header consisting of
1130 ==> @var{file name} <==
1133 before the output for each @var{file}.
1135 @cindex BSD @code{tail}
1136 GNU @code{tail} can output any amount of data (some other versions of
1137 @code{tail} cannot). It also has no @samp{-r} option (print in
1138 reverse), since reversing a file is really a different job from printing
1139 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1140 only reverse files that are at most as large as its buffer, which is
1141 typically 32k. A more reliable and versatile way to reverse files is
1142 the GNU @code{tac} command.
1144 @code{tail} accepts two option formats: the new one, in which numbers
1145 are arguments to the options (@samp{-n 1}), and the old one, in which
1146 the number precedes any option letters (@samp{-1} or @samp{+1}).
1148 If any option-argument is a number @var{n} starting with a @samp{+},
1149 @code{tail} begins printing with the @var{n}th item from the start of
1150 each file, instead of from the end.
1152 The program accepts the following options. Also see @ref{Common options}.
1158 @opindex -@var{count}
1159 @opindex +@var{count}
1160 This option is only recognized if it is specified first. @var{count} is
1161 a decimal number optionally followed by a size letter (@samp{b},
1162 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1163 or other option letters (@samp{cfqv}).
1165 @item -c @var{bytes}
1166 @itemx --bytes=@var{bytes}
1169 Output the last @var{bytes} bytes, instead of final lines. Appending
1170 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1177 @cindex growing files
1178 Loop forever trying to read more characters at the end of the file,
1179 presumably because the file is growing. Ignored if reading from a pipe.
1180 If more than one file is given, @code{tail} prints a header whenever it
1181 gets output from a different file, to indicate which file that output is
1185 @itemx --lines=@var{n}
1188 Output the last @var{n} lines.
1196 Never print file name headers.
1202 Always print file name headers.
1207 @node split invocation
1208 @section @code{split}: Split a file into fixed-size pieces
1211 @cindex splitting a file into pieces
1212 @cindex pieces, splitting a file into
1214 @code{split} creates output files containing consecutive sections of
1215 @var{input} (standard input if none is given or @var{input} is
1216 @samp{-}). Synopsis:
1219 split [@var{option}] [@var{input} [@var{prefix}]]
1222 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1223 left over for the last section), into each output file.
1225 @cindex output file name prefix
1226 The output files' names consist of @var{prefix} (@samp{x} by default)
1227 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1228 that concatenating the output files in sorted order by file name produces
1229 the original input file. (If more than 676 output files are required,
1230 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1232 The program accepts the following options. Also see @ref{Common options}.
1237 @itemx -l @var{lines}
1238 @itemx --lines=@var{lines}
1241 Put @var{lines} lines of @var{input} into each output file.
1243 @item -b @var{bytes}
1244 @itemx --bytes=@var{bytes}
1247 Put the first @var{bytes} bytes of @var{input} into each output file.
1248 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1249 @samp{m} by 1048576.
1251 @item -C @var{bytes}
1252 @itemx --line-bytes=@var{bytes}
1254 @opindex --line-bytes
1255 Put into each output file as many complete lines of @var{input} as
1256 possible without exceeding @var{bytes} bytes. For lines longer than
1257 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1258 less than @var{bytes} bytes of the line are left, then continue
1259 normally. @var{bytes} has the same format as for the @samp{--bytes}
1265 @node csplit invocation
1266 @section @code{csplit}: Split a file into context-determined pieces
1269 @cindex context splitting
1270 @cindex splitting a file into pieces by context
1272 @code{csplit} creates zero or more output files containing sections of
1273 @var{input} (standard input if @var{input} is @samp{-}). Synopsis:
1276 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1279 The contents of the output files are determined by the @var{pattern}
1280 arguments, as detailed below. An error occurs if a @var{pattern}
1281 argument refers to a nonexistent line of the input file (e.g., if no
1282 remaining line matches a given regular expression). After every
1283 @var{pattern} has been matched, any remaining input is copied into one
1286 By default, @code{csplit} prints the number of bytes written to each
1287 output file after it has been created.
1289 The types of pattern arguments are:
1294 Create an output file containing the input up to but not including line
1295 @var{n} (a positive integer). If followed by a repeat count, also
1296 create an output file containing the next @var{line} lines of the input
1297 file once for each repeat.
1299 @item /@var{regexp}/[@var{offset}]
1300 Create an output file containing the current line up to (but not
1301 including) the next line of the input file that contains a match for
1302 @var{regexp}. The optional @var{offset} is a @samp{+} or @samp{-}
1303 followed by a positive integer. If it is given, the input up to the
1304 matching line plus or minus @var{offset} is put into the output file,
1305 and the line after that begins the next section of input.
1307 @item %@var{regexp}%[@var{offset}]
1308 Like the previous type, except that it does not create an output
1309 file, so that section of the input file is effectively ignored.
1311 @item @{@var{repeat-count}@}
1312 Repeat the previous pattern @var{repeat-count} additional
1313 times. @var{repeat-count} can either be a positive integer or an
1314 asterisk, meaning repeat as many times as necessary until the input is
1319 The output files' names consist of a prefix (@samp{xx} by default)
1320 followed by a suffix. By default, the suffix is an ascending sequence
1321 of two-digit decimal numbers from @samp{00} and up to @samp{99}. In any
1322 case, concatenating the output files in sorted order by filename
1323 produces the original input file.
1325 By default, if @code{csplit} encounters an error or receives a hangup,
1326 interrupt, quit, or terminate signal, it removes any output files
1327 that it has created so far before it exits.
1329 The program accepts the following options. Also see @ref{Common options}.
1333 @item -f @var{prefix}
1334 @itemx --prefix=@var{prefix}
1337 @cindex output file name prefix
1338 Use @var{prefix} as the output file name prefix.
1340 @item -b @var{suffix}
1341 @itemx --suffix=@var{suffix}
1344 @cindex output file name suffix
1345 Use @var{suffix} as the output file name suffix. When this option is
1346 specified, the suffix string must include exactly one
1347 @code{printf(3)}-style conversion specification, possibly including
1348 format specification flags, a field width, a precision specifications,
1349 or all of these kinds of modifiers. The format letter must convert a
1350 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1351 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed. The
1352 entire @var{suffix} is given (with the current output file number) to
1353 @code{sprintf(3)} to form the file name suffixes for each of the
1354 individual output files in turn. If this option is used, the
1355 @samp{--digits} option is ignored.
1357 @item -n @var{digits}
1358 @itemx --digits=@var{digits}
1361 Use output file names containing numbers that are @var{digits} digits
1362 long instead of the default 2.
1367 @opindex --keep-files
1368 Do not remove output files when errors are encountered.
1371 @itemx --elide-empty-files
1373 @opindex --elide-empty-files
1374 Suppress the generation of zero-length output files. (In cases where
1375 the section delimiters of the input file are supposed to mark the first
1376 lines of each of the sections, the first output file will generally be a
1377 zero-length file unless you use this option.) The output file sequence
1378 numbers always run consecutively starting from 0, even when this option
1389 Do not print counts of output file sizes.
1394 @node Summarizing files
1395 @chapter Summarizing files
1397 @cindex summarizing files
1399 These commands generate just a few numbers representing entire
1403 * wc invocation:: Print byte, word, and line counts.
1404 * sum invocation:: Print checksum and block counts.
1405 * cksum invocation:: Print CRC checksum and byte counts.
1406 * md5sum invocation:: Print or check message-digests.
1411 @section @code{wc}: Print byte, word, and line counts
1418 @code{wc} counts the number of bytes, whitespace-separated words, and
1419 newlines in each given @var{file}, or standard input if none are given
1420 or for a @var{file} of @samp{-}. Synopsis:
1423 wc [@var{option}]@dots{} [@var{file}]@dots{}
1426 @cindex total counts
1427 @code{wc} prints one line of counts for each file, and if the file was
1428 given as an argument, it prints the file name following the counts. If
1429 more than one @var{file} is given, @code{wc} prints a final line
1430 containing the cumulative counts, with the file name @file{total}. The
1431 counts are printed in this order: newlines, words, bytes.
1433 By default, @code{wc} prints all three counts. Options can specify
1434 that only certain counts be printed. Options do not undo others
1435 previously given, so
1442 prints both the byte counts and the word counts.
1444 The program accepts the following options. Also see @ref{Common options}.
1454 Print only the byte counts.
1460 Print only the word counts.
1466 Print only the newline counts.
1471 @node sum invocation
1472 @section @code{sum}: Print checksum and block counts
1475 @cindex 16-bit checksum
1476 @cindex checksum, 16-bit
1478 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1479 standard input if none are given or for a @var{file} of @samp{-}. Synopsis:
1482 sum [@var{option}]@dots{} [@var{file}]@dots{}
1485 @code{sum} prints the checksum for each @var{file} followed by the
1486 number of blocks in the file (rounded up). If more than one @var{file}
1487 is given, file names are also printed (by default). (With the
1488 @samp{--sysv} option, corresponding file name are printed when there is
1489 at least one file argument.)
1491 By default, GNU @code{sum} computes checksums using an algorithm
1492 compatible with BSD @code{sum} and prints file sizes in units of
1495 The program accepts the following options. Also see @ref{Common options}.
1501 @cindex BSD @code{sum}
1502 Use the default (BSD compatible) algorithm. This option is included for
1503 compatibility with the System V @code{sum}. Unless @samp{-s} was also
1504 given, it has no effect.
1510 @cindex System V @code{sum}
1511 Compute checksums using an algorithm compatible with System V
1512 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1516 @code{sum} is provided for compatibility; the @code{cksum} program (see
1517 next section) is preferable in new applications.
1520 @node cksum invocation
1521 @section @code{cksum}: Print CRC checksum and byte counts
1524 @cindex cyclic redundancy check
1525 @cindex CRC checksum
1527 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1528 given @var{file}, or standard input if none are given or for a
1529 @var{file} of @samp{-}. Synopsis:
1532 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1535 @code{cksum} prints the CRC checksum for each file along with the number
1536 of bytes in the file, and the filename unless no arguments were given.
1538 @code{cksum} is typically used to ensure that files have been
1539 transferred by unreliable means (e.g., netnews) have not been corrupted,
1540 by comparing the @code{cksum} output for the received files with the
1541 @code{cksum} output for the original files (typically given in the
1544 The CRC algorithm is specified by the POSIX.2 standard. It is not
1545 compatible with the BSD or System V @code{sum} algorithms (see the
1546 previous section); it is more robust.
1548 The only options are @samp{--help} and @samp{--version}. @xref{Common
1552 @node md5sum invocation
1553 @section @code{md5sum}: Print or check message-digests
1556 @cindex 128-bit checksum
1557 @cindex checksum, 128-bit
1558 @cindex fingerprint, 128-bit
1559 @cindex message-digest, 128-bit
1561 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1562 @dfn{message-digest} for each given @var{file}, or standard input if
1563 none are given or for a @var{file} of @samp{-}. It can also check if the
1564 checksum has changed. Synopsis:
1567 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1570 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1571 indicating a binary or text input file, and the filename.
1573 The program accepts the following options. Also see @ref{Common options}.
1581 @cindex binary input files
1582 Treat input files as binary. This makes no difference on Unix systems,
1583 but other systems have different internal and external character
1584 representations, notably to mark end-of-line.
1587 @itemx --check=@var{file}
1588 @var{file} is taken as the output of a former run of @samp{md5sum}: each
1589 line consists of an MD5 checksum, a binary/text flag, and a filename.
1590 The file will be opened (with each possible relative path) and the its
1591 message-digest computed. If this computed message digest is not the
1592 same as that given in the line, the file will be marked as failed.
1595 @itemx --string=@var{string}
1598 Compute the message digest for @var{string}, instead of for a file. The
1599 result is the same as for a file with contains exactly @var{string}.
1605 @cindex text input files
1606 Treat all input files as text files. This is the reverse of
1612 Print progress information.
1617 @node Operating on sorted files
1618 @chapter Operating on sorted files
1620 @cindex operating on sorted files
1621 @cindex sorted files, operations on
1623 These commands work with (or produce) sorted files.
1626 * sort invocation:: Sort text files.
1627 * uniq invocation:: Uniqify files.
1628 * comm invocation:: Compare two sorted files line by line.
1632 @node sort invocation
1633 @section @code{sort}: Sort text files
1636 @cindex sorting files
1638 @code{sort} sorts, merges, or compares all the lines from the given
1639 files, or standard input if none are given or for a @var{file} of
1640 @samp{-}. By default, @code{sort} writes the results to standard
1644 sort [@var{option}]@dots{} [@var{file}]@dots{}
1647 @code{sort} has three modes of operation: sort (the default), merge,
1648 and check for sortedness. The following options change the operation
1655 @cindex checking for sortedness
1656 Check whether the given files are already sorted: if they are not all
1657 sorted, print an error message and exit with a status of 1.
1661 @cindex merging sorted files
1662 Merge the given files by sorting them as a group. Each input file must
1663 always be individually sorted. It always works to sort instead of
1664 merge; merging is provided because it is faster, in the case where it
1669 A pair of lines is compared as follows: if any key fields have been
1670 specified, @code{sort} compares each pair of fields, in the order
1671 specified on the command line, according to the associated ordering
1672 options, until a difference is found or no fields are left.
1674 If any of the global options @samp{Mbdfinr} are given but no key fields
1675 are specified, @code{sort} compares the entire lines according to the
1678 Finally, as a last resort when all keys compare equal (or if no
1679 ordering options were specified at all), @code{sort} compares the lines
1680 byte by byte in machine collating sequence. The last resort comparison
1681 honors the @samp{-r} global option. The @samp{-s} (stable) option
1682 disables this last-resort comparison so that lines in which all fields
1683 compare equal are left in their original relative order. If no fields
1684 or global options are specified, @samp{-s} has no effect.
1686 GNU @code{sort} (as specified for all GNU utilities) has no limits on
1687 input line length or restrictions on bytes allowed within lines. In
1688 addition, if the final byte of an input file is not a newline, GNU
1689 @code{sort} silently supplies one.
1692 If the environment variable @code{TMPDIR} is set, @code{sort} uses its
1693 value as the directory for temporary files instead of @file{/tmp}. The
1694 @samp{-T @var{tempdir}} option in turn overrides the environment
1697 The following options affect the ordering of output lines. They may be
1698 specified globally or as part of a specific key field. If no key
1699 fields are specified, global options apply to comparison of entire
1700 lines; otherwise the global options are inherited by key fields that do
1701 not specify any special options of their own.
1707 @cindex blanks, ignoring leading
1708 Ignore leading blanks when finding sort keys in each line.
1712 @cindex phone directory order
1713 @cindex telephone directory order
1714 Sort in @dfn{phone directory} order: ignore all characters except
1715 letters, digits and blanks when sorting.
1719 @cindex case folding
1720 Fold lowercase characters into the equivalent uppercase characters when
1721 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
1725 @cindex unprintable characters, ignoring
1726 Ignore characters outside the printable ASCII range 040-0176 octal
1727 (inclusive) when sorting.
1731 @cindex months, sorting by
1732 An initial string, consisting of any amount of whitespace, followed
1733 by three letters abbreviating a month name, is folded to UPPER case and
1734 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
1735 Invalid names compare low to valid names.
1739 @cindex numeric sort
1740 Sort numerically: the number begins each line; specifically, it consists
1741 of optional whitespace, an optional @samp{-} sign, and zero or more
1742 digits, optionally followed by a decimal point and zero or more digits.
1746 @cindex reverse sorting
1747 Reverse the result of comparison, so that lines with greater key values
1748 appear earlier in the output instead of later.
1756 @item -o @var{output-file}
1758 @cindex overwriting of input, allowed
1759 Write output to @var{output-file} instead of standard output.
1760 If @var{output-file} is one of the input files, @code{sort} copies
1761 it to a temporary file before sorting and writing the output to
1764 @item -t @var{separator}
1766 @cindex field separator character
1767 Use character @var{separator} as the field separator when finding the
1768 sort keys in each line. By default, fields are separated by the empty
1769 string between a non-whitespace character and a whitespace character.
1770 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
1771 into fields @w{@samp{ foo}} and @w{@samp{ bar}}. The field separator is
1772 not considered to be part of either the field preceding or the field
1777 @cindex uniqifying output
1778 For the default case or the @samp{-m} option, only output the first
1779 of a sequence of lines that compare equal. For the @samp{-c} option,
1780 check that no pair of consecutive lines compares equal.
1782 @item -k @var{pos1}[,@var{pos2}]
1785 The recommended, POSIX, option for specifying a sort field. The field
1786 consists of the line between @var{pos1} and @var{pos2} (or the end of
1787 the line, if @var{pos2} is omitted), inclusive. Fields and character
1788 positions are numbered starting with 1. See below.
1790 @item +@var{pos1}[-@var{pos2}]
1791 The obsolete, traditional option for specifying a sort field. The field
1792 consists of the line between @var{pos1} and up to but not including
1793 @var{pos2} (or the end of the line if @var{pos2} is omitted). Fields
1794 and character positions are numbered starting with 0. See below.
1798 In addition, when GNU @code{sort} is invoked with exactly one argument,
1799 options @samp{--help} and @samp{--version} are recognized. @xref{Common
1802 Historical (BSD and System V) implementations of @code{sort} have
1803 differed in their interpretation of some options, particularly
1804 @samp{-b}, @samp{-f}, and @samp{-n}. GNU sort follows the POSIX
1805 behavior, which is usually (but not always!) like the System V behavior.
1806 According to POSIX, @samp{-n} no longer implies @samp{-b}. For
1807 consistency, @samp{-M} has been changed in the same way. This may
1808 affect the meaning of character positions in field specifications in
1809 obscure cases. The only fix is to add an explicit @samp{-b}.
1811 A position in a sort field specified with the @samp{-k} or @samp{+}
1812 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
1813 of the field to use and @var{c} is the number of the first character
1814 from the beginning of the field (for @samp{+@var{pos}}) or from the end
1815 of the previous field (for @samp{-@var{pos}}). If the @samp{.@var{c}}
1816 is omitted, it's taken to be the first character in the field. If the
1817 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
1818 specification is counted from the first nonblank character of the field
1819 (for @samp{+@var{pos}}) or from the first nonblank character following
1820 the previous field (for @samp{-@var{pos}}).
1822 A sort key option may also have any of the option letters @samp{Mbdfinr}
1823 appended to it, in which case the global ordering options are not used
1824 for that particular field. The @samp{-b} option may be independently
1825 attached to either or both of the @samp{+@var{pos}} and
1826 @samp{-@var{pos}} parts of a field specification, and if it is inherited
1827 from the global options it will be attached to both. If a @samp{-n} or
1828 @samp{-M} option is used, thus implying a @samp{-b} option, the
1829 @samp{-b} option is taken to apply to both the @samp{+@var{pos}} and the
1830 @samp{-@var{pos}} parts of a key specification. Keys may span multiple
1834 @node uniq invocation
1835 @section @code{uniq}: Uniqify files
1838 @cindex uniqify files
1840 @code{uniq} writes the unique lines in the given @file{input}, or
1841 standard input if nothing is given or for an @var{input} name of
1845 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
1848 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
1849 discards all but one of identical successive lines. Optionally, it can
1850 instead show only lines that appear exactly once, or lines that appear
1853 The input must be sorted. If your input is not sorted, perhaps you want
1854 to use @code{sort -u}.
1856 If no @var{output} file is specified, @code{uniq} writes to standard
1859 The program accepts the following options. Also see @ref{Common options}.
1865 @itemx --skip-fields=@var{n}
1868 @opindex --skip-fields
1869 Skip @var{n} fields on each line before checking for uniqueness. Fields
1870 are sequences of non-space non-tab characters that are separated from
1871 each other by at least one spaces or tabs.
1875 @itemx --skip-chars=@var{n}
1878 @opindex --skip-chars
1879 Skip @var{n} characters before checking for uniqueness. If you use both
1880 the field and character skipping options, fields are skipped over first.
1886 Print the number of times each line occurred along with the line.
1892 @cindex duplicate lines, outputting
1893 Print only duplicate lines.
1899 @cindex unique lines, outputting
1900 Print only unique lines.
1903 @itemx --check-chars=@var{n}
1905 @opindex --check-chars
1906 Compare @var{n} characters on each line (after skipping any specified
1907 fields and characters). By default the entire rest of the lines are
1913 @node comm invocation
1914 @section @code{comm}: Compare two sorted files line by line
1917 @cindex line-by-line comparison
1918 @cindex comparing sorted files
1920 @code{comm} writes to standard output lines that are common, and lines
1921 that are unique, to two input files; a file name of @samp{-} means
1922 standard input. Synopsis:
1925 comm [@var{option}]@dots{} @var{file1} @var{file2}
1928 The input files must be sorted before @code{comm} can be used.
1930 @cindex differing lines
1931 @cindex common lines
1932 With no options, @code{comm} produces three column output. Column one
1933 contains lines unique to @var{file1}, column two contains lines unique
1934 to @var{file2}, and column three contains lines common to both files.
1939 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
1940 the corresponding columns. Also see @ref{Common options}.
1943 @node Operating on fields within a line
1944 @chapter Operating on fields within a line
1947 * cut invocation:: Print selected parts of lines.
1948 * paste invocation:: Merge lines of files.
1949 * join invocation:: Join lines on a common field.
1953 @node cut invocation
1954 @section @code{cut}: Print selected parts of lines
1957 @code{cut} writes to standard output selected parts of each line of each
1958 input file, or standard input if no files are given or for a file name of
1962 cut [@var{option}]@dots{} [@var{file}]@dots{}
1965 In the table which follows, the @var{byte-list}, @var{character-list},
1966 and @var{field-list} are one or more numbers or ranges (two numbers
1967 separated by a dash) separated by commas. Bytes, characters, and
1968 fields are numbered from starting at 1. Incomplete ranges may be
1969 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
1970 @samp{@var{n}} through end of line or last field.
1972 The program accepts the following options. Also see @ref{Common
1977 @item -b @var{byte-list}
1978 @itemx --bytes=@var{byte-list}
1981 Print only the bytes in positions listed in @var{byte-list}. Tabs and
1982 backspaces are treated like any other character; they take up 1 byte.
1984 @item -c @var{character-list}
1985 @itemx --characters=@var{character-list}
1987 @opindex --characters
1988 Print only characters in positions listed in @var{character-list}.
1989 The same as @samp{-b} for now, but internationalization will change
1990 that. Tabs and backspaces are treated like any other character; they
1991 take up 1 character.
1993 @item -f @var{field-list}
1994 @itemx --fields=@var{field-list}
1997 Print only the fields listed in @var{field-list}. Fields are
1998 separated by a @key{TAB} by default.
2000 @item -d @var{delim}
2001 @itemx --delimiter=@var{delim}
2003 @opindex --delimiter
2004 For @samp{-f}, fields are separated by the first character in @var{delim}
2005 (default is @key{TAB}).
2009 Do not split multibyte characters (no-op for now).
2012 @itemx --only-delimited
2014 @opindex --only-delimited
2015 For @samp{-f}, do not print lines that do not contain the field separator
2021 @node paste invocation
2022 @section @code{paste}: Merge lines of files
2025 @cindex merging files
2027 @code{paste} writes to standard output lines consisting of sequentially
2028 corresponding lines of each given file, separated by @key{TAB}.
2029 Standard input is used for a file name of @samp{-} or if no input files
2035 paste [@var{option}]@dots{} [@var{file}]@dots{}
2038 The program accepts the following options. Also see @ref{Common options}.
2046 Paste the lines of one file at a time rather than one line from each
2049 @item -d @var{delim-list}
2050 @itemx --delimiters @var{delim-list}
2052 @opindex --delimiters
2053 Consecutively use the characters in @var{delim-list} instead of
2054 @key{TAB} to separate merged lines. When @var{delim-list} is
2055 exhausted, start again at its beginning.
2060 @node join invocation
2061 @section @code{join}: Join lines on a common field
2064 @cindex common field, joining on
2066 @code{join} writes to standard output a line for each pair of input
2067 lines that have identical join fields. Synopsis:
2070 join [@var{option}]@dots{} @var{file1} @var{file2}
2073 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
2074 meaning standard input. @var{file1} and @var{file2} should be already
2075 sorted in increasing order (not numerically) on the join fields; unless
2076 the @samp{-t} option is given, they should be sorted ignoring blanks at
2077 the start of the line, as in @code{sort -b}.
2079 The defaults are: the join field is the first field in each line;
2080 fields in the input are separated by one or more blanks, with leading
2081 blanks on the line ignored; fields in the output are separated by a
2082 space; each output line consists of the join field, the remaining
2083 fields from @var{file1}, then the remaining fields from @var{file2}.
2085 The program accepts the following options. Also see @ref{Common options}.
2089 @item -a @var{file-number}
2091 Print a line for each unpairable line in file @var{file-number} (either
2092 @samp{1} or @samp{2}), in addition to the normal output.
2094 @item -e @var{string}
2096 Replace those output fields that are missing in the input with
2099 @item -1 @var{field}
2100 @itemx -j1 @var{field}
2103 Join on field @var{field} (a positive integer) of file 1.
2105 @item -2 @var{field}
2106 @itemx -j2 @var{field}
2109 Join on field @var{field} (a positive integer) of file 2.
2111 @item -j @var{field}
2112 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
2114 @item -o @var{field-list}@dots{}
2115 Construct each output line according to the format in @var{field-list}.
2116 Each element in @var{field-list} consists of a file number (either 1 or
2117 2), a period, and a field number (a positive integer). The elements in
2118 the list are separated by commas or blanks. Multiple @var{field-list}
2119 arguments can be given after a single @samp{-o} option; the values
2120 of all lists given with @samp{-o} are concatenated together.
2123 Use character @var{char} as the input and output field separator.
2125 @item -v @var{file-number}
2126 Print a line for each unpairable line in file @var{file-number}
2127 (either 1 or 2), instead of the normal output.
2131 In addition, when GNU @code{join} is invoked with exactly one argument,
2132 options @samp{--help} and @samp{--version} are recognized. @xref{Common
2136 @node Operating on characters
2137 @chapter Operating on characters
2139 @cindex operating on characters
2141 This commands operate on individual characters.
2144 * tr invocation:: Translate, squeeze, and/or delete characters.
2145 * expand invocation:: Convert tabs to spaces.
2146 * unexpand invocation:: Convert spaces to tabs.
2151 @section @code{tr}: Translate, squeeze, and/or delete characters
2158 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
2161 @code{tr} copies standard input to standard output, performing
2162 one of the following operations:
2166 translate, and optionally squeeze repeated characters in the result,
2168 squeeze repeated characters,
2172 delete characters, then squeeze repeated characters from the result.
2175 The @var{set1} and (if given) @var{set2} arguments define ordered
2176 sets of characters, referred to below as @var{set1} and @var{set2}. These
2177 sets are the characters of the input that @code{tr} operates on.
2178 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
2179 complement (all of the characters that are not in @var{set1}).
2182 * Character sets:: Specifying sets of characters.
2183 * Translating:: Changing one characters to another.
2184 * Squeezing:: Squeezing repeats and deleting.
2185 * Warnings in tr:: Warning messages.
2189 @node Character sets
2190 @subsection Specifying sets of characters
2192 @cindex specifying sets of characters
2194 The format of the @var{set1} and @var{set2} arguments resembles
2195 the format of regular expressions; however, they are not regular
2196 expressions, only lists of characters. Most characters simply
2197 represent themselves in these strings, but the strings can contain
2198 the shorthands listed below, for convenience. Some of them can be
2199 used only in @var{set1} or @var{set2}, as noted below.
2203 @item Backslash escapes.
2204 @cindex backslash escapes
2206 A backslash followed by a character not listed below causes an error
2225 The character with the value given by @var{ooo}, which is 1 to 3
2234 The notation @samp{@var{m}-@var{n}} expands to all of the characters
2235 from @var{m} through @var{n}, in ascending order. @var{m} should
2236 collate before @var{n}; if it doesn't, an error results. As an example,
2237 @samp{0-9} is the same as @samp{0123456789}. Although GNU @code{tr}
2238 does not support the System V syntax that uses square brackets to
2239 enclose ranges, translations specified in that format will still work as
2240 long as the brackets in @var{string1} correspond to identical brackets
2243 @item Repeated characters.
2244 @cindex repeated characters
2246 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
2247 copies of character @var{c}. Thus, @samp{[y*6]} is the same as
2248 @samp{yyyyyy}. The notation @samp{[@var{c}*]} in @var{string2} expands
2249 to as many copies of @var{c} as are needed to make @var{set2} as long as
2250 @var{set1}. If @var{n} begins with @samp{0}, it is interpreted in
2251 octal, otherwise in decimal.
2253 @item Character classes.
2254 @cindex characters classes
2256 The notation @samp{[:@var{class}:]} expands to all of the characters in
2257 the (predefined) class @var{class}. The characters expand in no
2258 particular order, except for the @code{upper} and @code{lower} classes,
2259 which expand in ascending order. When the @samp{--delete} (@samp{-d})
2260 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
2261 character class can be used in @var{set2}. Otherwise, only the
2262 character classes @code{lower} and @code{upper} are accepted in
2263 @var{set2}, and then only if the corresponding character class
2264 (@code{upper} and @code{lower}, respectively) is specified in the same
2265 relative position in @var{set1}. Doing this specifies case conversion.
2266 The class names are given below; an error results when an invalid class
2278 Horizontal whitespace.
2287 Printable characters, not including space.
2293 Printable characters, including space.
2296 Punctuation characters.
2299 Horizontal or vertical whitespace.
2308 @item Equivalence classes.
2309 @cindex equivalence classes
2311 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
2312 equivalent to @var{c}, in no particular order. Equivalence classes are
2313 a relatively recent invention intended to support non-English alphabets.
2314 But there seems to be no standard way to define them or determine their
2315 contents. Therefore, they are not fully implemented in GNU @code{tr};
2316 each character's equivalence class consists only of that character,
2317 which is of no particular use.
2323 @subsection Translating
2325 @cindex translating characters
2327 @code{tr} performs translation when @var{set1} and @var{set2} are
2328 both given and the @samp{--delete} (@samp{-d}) option is not given.
2329 @code{tr} translates each character of its input that is in @var{set1}
2330 to the corresponding character in @var{set2}. Characters not in
2331 @var{set1} are passed through unchanged. When a character appears more
2332 than once in @var{set1} and the corresponding characters in @var{set2}
2333 are not all the same, only the final one is used. For example, these
2334 two commands are equivalent:
2341 A common use of @code{tr} is to convert lowercase characters to
2342 uppercase. This can be done in many ways. Here are three of them:
2345 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
2347 tr '[:lower:]' '[:upper:]'
2350 When @code{tr} is performing translation, @var{set1} and @var{set2}
2351 typically have the same length. If @var{set1} is shorter than
2352 @var{set2}, the extra characters at the end of @var{set2} are ignored.
2354 On the other hand, making @var{set1} longer than @var{set2} is not
2355 portable; POSIX.2 says that the result is undefined. In this situation,
2356 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
2357 the last character of @var{set2} as many times as necessary. System V
2358 @code{tr} truncates @var{set1} to the length of @var{set2}.
2360 By default, GNU @code{tr} handles this case like BSD @code{tr}. When
2361 the @samp{--truncate-set1} (@samp{-t}) option is given, GNU @code{tr}
2362 handles this case like the System V @code{tr} instead. This option is
2363 ignored for operations other than translation.
2365 Acting like System V @code{tr} in this case breaks the relatively common
2369 tr -cs A-Za-z0-9 '\012'
2373 because it converts only zero bytes (the first element in the
2374 complement of @var{set1}), rather than all non-alphanumerics, to
2379 @subsection Squeezing repeats and deleting
2381 @cindex squeezing repeat characters
2382 @cindex deleting characters
2384 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
2385 removes any input characters that are in @var{set1}.
2387 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
2388 @code{tr} replaces each input sequence of a repeated character that
2389 is in @var{set1} with a single occurrence of that character.
2391 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
2392 first performs any deletions using @var{set1}, then squeezes repeats
2393 from any remaining characters using @var{set2}.
2395 The @samp{--squeeze-repeats} option may also be used when translating,
2396 in which case @code{tr} first performs translation, then squeezes
2397 repeats from any remaining characters using @var{set2}.
2399 Here are some examples to illustrate various combinations of options:
2404 Remove all zero bytes:
2411 Put all words on lines by themselves. This converts all
2412 non-alphanumeric characters to newlines, then squeezes each string
2413 of repeated newlines into a single newline:
2416 tr -cs '[a-zA-Z0-9]' '[\n*]'
2420 Convert each sequence of repeated newlines to a single newline:
2429 @node Warnings in tr
2430 @subsection Warning messages
2432 @vindex POSIXLY_CORRECT
2433 Setting the environment variable @code{POSIXLY_CORRECT} turns off the
2434 following warning and error messages, for strict compliance with
2435 POSIX.2. Otherwise, the following diagnostics are issued:
2440 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
2441 is not, and @var{set2} is given, GNU @code{tr} by default prints
2442 a usage message and exits, because @var{set2} would not be used.
2443 The POSIX specification says that @var{set2} must be ignored in
2444 this case. Silently ignoring arguments is a bad idea.
2447 When an ambiguous octal escape is given. For example, @samp{\400}
2448 is actually @samp{\40} followed by the digit @samp{0}, because the
2449 value 400 octal does not fit into a single byte.
2453 GNU @code{tr} does not provide complete BSD or System V compatibility.
2454 For example, it is impossible to disable interpretation of the POSIX
2455 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}. Also, GNU
2456 @code{tr} does not delete zero bytes automatically, unlike traditional
2457 Unix versions, which provide no way to preserve zero bytes.
2460 @node expand invocation
2461 @section @code{expand}: Convert tabs to spaces
2464 @cindex tabs to spaces, converting
2465 @cindex converting tabs to spaces
2467 @code{expand} writes the contents of each given @var{file}, or standard
2468 input if none are given or for a @var{file} of @samp{-}, to standard
2469 output, with tab characters converted to the appropriate number of
2473 expand [@var{option}]@dots{} [@var{file}]@dots{}
2476 By default, @code{expand} converts all tabs to spaces. It preserves
2477 backspace characters in the output; they decrement the column count for
2478 tab calculations. The default action is equivalent to @samp{-8} (set
2479 tabs every 8 columns).
2481 The program accepts the following options. Also see @ref{Common options}.
2485 @item -@var{tab1}[,@var{tab2}]@dots{}
2486 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2487 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2491 @cindex tabstops, setting
2492 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2493 (default is 8). Otherwise, set the tabs at columns @var{tab1},
2494 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
2495 last tabstop given with single spaces. If the tabstops are specified
2496 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2497 blanks as well as by commas.
2503 @cindex initial tabs, converting
2504 Only convert initial tabs (those that precede all non-space or non-tab
2505 characters) on each line to spaces.
2510 @node unexpand invocation
2511 @section @code{unexpand}: Convert spaces to tabs
2515 @code{unexpand} writes the contents of each given @var{file}, or
2516 standard input if none are given or for a @var{file} of @samp{-}, to
2517 standard output, with strings of two or more space or tab characters
2518 converted to as many tabs as possible followed by as many spaces as are
2522 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
2525 By default, @code{unexpand} converts only initial spaces and tabs (those
2526 that precede all non space or tab characters) on each line. It
2527 preserves backspace characters in the output; they decrement the column
2528 count for tab calculations. By default, tabs are set at every 8th
2531 The program accepts the following options. Also see @ref{Common options}.
2535 @item -@var{tab1}[,@var{tab2}]@dots{}
2536 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2537 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2541 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2542 instead of the default 8. Otherwise, set the tabs at columns
2543 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
2544 tabs beyond the tabstops given unchanged. If the tabstops are specified
2545 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2546 blanks as well as by commas. This option implies the @samp{-a} option.
2552 Convert all strings of two or more spaces or tabs, not just initial
2560 @node Opening the software toolbox
2561 @chapter Opening the software toolbox
2563 This chapter originally appeared in @cite{Linux Journal}, volume 1,
2564 number 2, in the @cite{What's GNU?} column. It was written by Arnold
2568 * Toolbox introduction::
2570 * The @code{who} command::
2571 * The @code{cut} command::
2572 * The @code{sort} command::
2573 * The @code{uniq} command::
2574 * Putting the tools together::
2578 @node Toolbox introduction
2579 @unnumberedsec Toolbox introduction
2581 This month's column is only peripherally related to the GNU Project, in
2582 that it describes a number of the GNU tools on your Linux system and how they
2583 might be used. What it's really about is the ``Software Tools'' philosophy
2584 of program development and usage.
2586 The software tools philosophy was an important and integral concept
2587 in the initial design and development of Unix (of which Linux and GNU are
2588 essentially clones). Unfortunately, in the modern day press of
2589 Internetworking and flashy GUIs, it seems to have fallen by the
2590 wayside. This is a shame, since it provides a powerful mental model
2591 for solving many kinds of problems.
2593 Many people carry a Swiss Army knife around in their pants pockets (or
2594 purse). A Swiss Army knife is a handy tool to have: it has several knife
2595 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
2596 a number of other things on it. For the everyday, small miscellaneous jobs
2597 where you need a simple, general purpose tool, it's just the thing.
2599 On the other hand, an experienced carpenter doesn't build a house using
2600 a Swiss Army knife. Instead, he has a toolbox chock full of specialized
2601 tools---a saw, a hammer, a screwdriver, a plane, and so on. And he knows
2602 exactly when and where to use each tool; you won't catch him hammering nails
2603 with the handle of his screwdriver.
2605 The Unix developers at Bell Labs were all professional programmers and trained
2606 computer scientists. They had found that while a one-size-fits-all program
2607 might appeal to a user because there's only one program to use, in practice
2615 difficult to maintain and
2619 difficult to extend to meet new situations.
2622 Instead, they felt that programs should be specialized tools. In short, each
2623 program ``should do one thing well.'' No more and no less. Such programs are
2624 simpler to design, write, and get right---they only do one thing.
2626 Furthermore, they found that with the right machinery for hooking programs
2627 together, that the whole was greater than the sum of the parts. By combining
2628 several special purpose programs, you could accomplish a specific task
2629 that none of the programs was designed for, and accomplish it much more
2630 quickly and easily than if you had to write a special purpose program.
2631 We will see some (classic) examples of this further on in the column.
2632 (An important additional point was that, if necessary, take a detour
2633 and build any software tools you may need first, if you don't already
2634 have something appropriate in the toolbox.)
2636 @node I/O redirection
2637 @unnumberedsec I/O redirection
2639 Hopefully, you are familiar with the basics of I/O redirection in the
2640 shell, in particular the concepts of ``standard input,'' ``standard output,''
2641 and ``standard error''. Briefly, ``standard input'' is a data source, where
2642 data comes from. A program should not need to either know or care if the
2643 data source is a disk file, a keyboard, a magnetic tape, or even a punched
2644 card reader. Similarly, ``standard output'' is a data sink, where data goes
2645 to. The program should neither know nor care where this might be.
2646 Programs that only read their standard input, do something to the data,
2647 and then send it on, are called ``filters'', by analogy to filters in a
2650 With the Unix shell, it's very easy to set up data pipelines:
2653 program_to_create_data | filter1 | .... | filterN > final.pretty.data
2656 We start out by creating the raw data; each filter applies some successive
2657 transformation to the data, until by the time it comes out of the pipeline,
2658 it is in the desired form.
2660 This is fine and good for standard input and standard output. Where does the
2661 standard error come in to play? Well, think about @code{filter1} in
2662 the pipeline above. What happens if it encounters an error in the data it
2663 sees? If it writes an error message to standard output, it will just
2664 disappear down the pipeline into @code{filter2}'s input, and the
2665 user will probably never see it. So programs need a place where they can send
2666 error messages so that the user will notice them. This is standard error,
2667 and it is usually connected to your console or window, even if you have
2668 redirected standard output of your program away from your screen.
2670 For filter programs to work together, the format of the data has to be
2671 agreed upon. The most straightforward and easiest format to use is simply
2672 lines of text. Unix data files are generally just streams of bytes, with
2673 lines delimited by the @sc{ASCII} @sc{LF} (Line Feed) character,
2674 conventionally called a ``newline'' in the Unix literature. (This is
2675 @code{'\n'} if you're a C programmer.) This is the format used by all
2676 the traditional filtering programs. (Many earlier operating systems
2677 had elaborate facilities and special purpose programs for managing
2678 binary data. Unix has always shied away from such things, under the
2679 philosophy that it's easiest to simply be able to view and edit your
2680 data with a text editor.)
2682 OK, enough introduction. Let's take a look at some of the tools, and then
2683 we'll see how to hook them together in interesting ways. In the following
2684 discussion, we will only present those command line options that interest
2685 us. As you should always do, double check your system documentation
2688 @node The @code{who} command
2689 @unnumberedsec The @code{who} command
2691 The first program is the @code{who} command. By itself, it generates a
2692 list of the users who are currently logged in. Although I'm writing
2693 this on a single-user system, we'll pretend that several people are
2698 arnold console Jan 22 19:57
2699 miriam ttyp0 Jan 23 14:19(:0.0)
2700 bill ttyp1 Jan 21 09:32(:0.0)
2701 arnold ttyp2 Jan 23 20:48(:0.0)
2704 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
2705 There are three people logged in, and I am logged in twice. On traditional
2706 Unix systems, user names are never more than eight characters long. This
2707 little bit of trivia will be useful later. The output of @code{who} is nice,
2708 but the data is not all that exciting.
2710 @node The @code{cut} command
2711 @unnumberedsec The @code{cut} command
2713 The next program we'll look at is the @code{cut} command. This program
2714 cuts out columns or fields of input data. For example, we can tell it
2715 to print just the login name and full name from the @file{/etc/passwd
2716 file}. The @file{/etc/passwd} file has seven fields, separated by
2720 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
2723 To get the first and fifth fields, we would use cut like this:
2726 $ cut -d: -f1,5 /etc/passwd
2729 arnold:Arnold D. Robbins
2730 miriam:Miriam A. Robbins
2734 With the @samp{-c} option, @code{cut} will cut out specific characters
2735 (i.e., columns) in the input lines. This command looks like it might be
2736 useful for data filtering.
2739 @node The @code{sort} command
2740 @unnumberedsec The @code{sort} command
2742 Next we'll look at the @code{sort} command. This is one of the most
2743 powerful commands on a Unix-style system; one that you will often find
2744 yourself using when setting up fancy data plumbing. The @code{sort}
2745 command reads and sorts each file named on the command line. It then
2746 merges the sorted data and writes it to standard output. It will read
2747 standard input if no files are given on the command line (thus
2748 making it into a filter). The sort is based on the machine collating
2749 sequence (@sc{ASCII}) or based on user-supplied ordering criteria.
2752 @node The @code{uniq} command
2753 @unnumberedsec The @code{uniq} command
2755 Finally (at least for now), we'll look at the @code{uniq} program. When
2756 sorting data, you will often end up with duplicate lines, lines that
2757 are identical. Usually, all you need is one instance of each line.
2758 This is where @code{uniq} comes in. The @code{uniq} program reads its
2759 standard input, which it expects to be sorted. It only prints out one
2760 copy of each duplicated line. It does have several options. Later on,
2761 we'll use the @samp{-c} option, which prints each unique line, preceded
2762 by a count of the number of times that line occurred in the input.
2765 @node Putting the tools together
2766 @unnumberedsec Putting the tools together
2768 Now, let's suppose this is a large BBS system with dozens of users
2769 logged in. The management wants the SysOp to write a program that will
2770 generate a sorted list of logged in users. Furthermore, even if a user
2771 is logged in multiple times, his or her name should only show up in the
2774 The SysOp could sit down with the system documentation and write a C
2775 program that did this. It would take perhaps a couple of hundred lines
2776 of code and about two hours to write it, test it, and debug it.
2777 However, knowing the software toolbox, the SysOp can instead start out
2778 by generating just a list of logged on users:
2788 Next, sort the list:
2791 $ who | cut -c1-8 | sort
2798 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
2801 $ who | cut -c1-8 | sort | uniq
2807 The @code{sort} command actually has a @samp{-u} option that does what
2808 @code{uniq} does. However, @code{uniq} has other uses for which one
2809 cannot substitute @samp{sort -u}.
2811 The SysOp puts this pipeline into a shell script, and makes it available for
2812 all the users on the system:
2815 # cat > /usr/local/bin/listusers
2816 who | cut -c1-8 | sort | uniq
2818 # chmod +x /usr/local/bin/listusers
2821 There are four major points to note here. First, with just four
2822 programs, on one command line, the SysOp was able to save about two
2823 hours worth of work. Furthermore, the shell pipeline is just about as
2824 efficient as the C program would be, and it is much more efficient in
2825 terms of programmer time. People time is much more expensive than
2826 computer time, and in our modern ``there's never enough time to do
2827 everything'' society, saving two hours of programmer time is no mean
2830 Second, it is also important to emphasize that with the
2831 @emph{combination} of the tools, it is possible to do a special
2832 purpose job never imagined by the authors of the individual programs.
2834 Third, it is also valuable to build up your pipeline in stages, as we did here.
2835 This allows you to view the data at each stage in the pipeline, which helps
2836 you acquire the confidence that you are indeed using these tools correctly.
2838 Finally, by bundling the pipeline in a shell script, other users can use
2839 your command, without having to remember the fancy plumbing you set up for
2840 them. In terms of how you run them, shell scripts and compiled programs are
2843 After the previous warm-up exercise, we'll look at two additional, more
2844 complicated pipelines. For them, we need to introduce two more tools.
2846 The first is the @code{tr} command, which stands for ``transliterate.''
2847 The @code{tr} command works on a character-by-character basis, changing
2848 characters. Normally it is used for things like mapping upper case to
2852 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]'
2853 this example has mixed case!
2856 There are several options of interest:
2860 work on the complement of the listed characters, i.e.,
2861 operations apply to characters not in the given set
2864 delete characters in the first set from the output
2867 squeeze repeated characters in the output into just one character.
2870 We will be using all three options in a moment.
2872 The other command we'll look at is @code{comm}. The @code{comm}
2873 command takes two sorted input files as input data, and prints out the
2874 files' lines in three columns. The output columns are the data lines
2875 unique to the first file, the data lines unique to the second file, and
2876 the data lines that are common to both. The @samp{-1}, @samp{-2}, and
2877 @samp{-3} command line options omit the respective columns. (This is
2878 non-intuitive and takes a little getting used to.) For example:
2900 The single dash as a filename tells @code{comm} to read standard input
2901 instead of a regular file.
2903 Now we're ready to build a fancy pipeline. The first application is a word
2904 frequency counter. This helps an author determine if he or she is over-using
2907 The first step is to change the case of all the letters in our input file
2908 to one case. ``The'' and ``the'' are the same word when doing counting.
2911 $ tr '[A-Z]' '[a-z]' < whats.gnu | ...
2914 The next step is to get rid of punctuation. Quoted words and unquoted words
2915 should be treated identically; it's easiest to just get the punctuation out of
2919 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
2922 The second @code{tr} command operates on the complement of the listed
2923 characters, which are all the letters, the digits, the underscore, and
2924 the blank. The @samp{\012} represents the newline character; it has to
2925 be left alone. (The ASCII TAB character should also be included for
2926 good measure in a production script.)
2928 At this point, we have data consisting of words separated by blank space.
2929 The words only contain alphanumeric characters (and the underscore). The
2930 next step is break the data apart so that we have one word per line. This
2931 makes the counting operation much easier, as we will see shortly.
2934 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
2935 > tr -s '[ ]' '\012' | ...
2938 This command turns blanks into newlines. The @samp{-s} option squeezes
2939 multiple newline characters in the output into just one. This helps us
2940 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
2941 This is what the shell prints when it notices you haven't finished
2942 typing in all of a command.)
2944 We now have data consisting of one word per line, no punctuation, all one
2945 case. We're ready to count each word:
2948 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
2949 > tr -s '[ ]' '\012' | sort | uniq -c | ...
2952 At this point, the data might look something like this:
2965 The output is sorted by word, not by count! What we want is the most
2966 frequently used words first. Fortunately, this is easy to accomplish,
2967 with the help of two more @code{sort} options:
2971 do a numeric sort, not an ASCII one
2974 reverse the order of the sort
2977 The final pipeline looks like this:
2980 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
2981 > tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
2990 Whew! That's a lot to digest. Yet, the same principles apply. With six
2991 commands, on two lines (really one long one split for convenience), we've
2992 created a program that does something interesting and useful, in much
2993 less time than we could have written a C program to do the same thing.
2995 A minor modification to the above pipeline can give us a simple spelling
2996 checker! To determine if you've spelled a word correctly, all you have to
2997 do is look it up in a dictionary. If it is not there, then chances are
2998 that your spelling is incorrect. So, we need a dictionary. If you
2999 have the Slackware Linux distribution, you have the file
3000 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
3003 Now, how to compare our file with the dictionary? As before, we generate
3004 a sorted list of words, one per line:
3007 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3008 > tr -s '[ ]' '\012' | sort -u | ...
3011 Now, all we need is a list of words that are @emph{not} in the
3012 dictionary. Here is where the @code{comm} command comes in.
3015 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3016 > tr -s '[ ]' '\012' | sort -u |
3017 > comm -23 - /usr/lib/ispell/ispell.words
3020 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
3021 dictionary (the second file), and lines that are in both files. Lines
3022 only in the first file (standard input, our stream of words), are
3023 words that are not in the dictionary. These are likely candidates for
3024 spelling errors. This pipeline was the first cut at a production
3025 spelling checker on Unix.
3027 There are some other tools that deserve brief mention.
3031 search files for text that matches a regular expression
3034 like @code{grep}, but with more powerful regular expressions
3037 count lines, words, characters
3040 a T-fitting for data pipes, copies data to files and to standard output
3043 the stream editor, an advanced tool
3046 a data manipulation language, another advanced tool
3049 The software tools philosophy also espoused the following bit of
3050 advice: ``Let someone else do the hard part.'' This means, take
3051 something that gives you most of what you need, and then massage it the
3052 rest of the way until it's in the form that you want.
3058 Each program should do one thing well. No more, no less.
3061 Combining programs with appropriate plumbing leads to results where
3062 the whole is greater than the sum of the parts. It also leads to novel
3063 uses of programs that the authors might never have imagined.
3066 Programs should never print extraneous header or trailer data, since these
3067 could get sent on down a pipeline. (A point we didn't mention earlier.)
3070 Let someone else do the hard part.
3073 Know your toolbox! Use each program appropriately. If you don't have an
3074 appropriate tool, build one.
3077 As of this writing, all the programs we've discussed are available via
3078 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
3079 @file{/pub/gnu/textutils-1.9.tar.gz} directory.@footnote{Version 1.9 was
3080 current when this column was written. Check the nearest GNU archive for
3081 the current version.}
3083 None of what I have presented in this column is new. The Software Tools
3084 philosophy was first introduced in the book @cite{Software Tools},
3085 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
3086 0-201-03669-X). This book showed how to write and use software
3087 tools. It was written in 1976, using a preprocessor for FORTRAN named
3088 @code{ratfor} (RATional FORtran). At the time, C was not as ubiquitous
3089 as it is now; FORTRAN was. The last chapter presented a @code{ratfor}
3090 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
3091 awful lot like C; if you know C, you won't have any problem following
3094 In 1981, the book was updated and made available as @cite{Software
3095 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7). Both books
3096 remain in print, and are well worth reading if you're a programmer.
3097 They certainly made a major change in how I view programming.
3099 Initially, the programs in both books were available (on 9-track tape)
3100 from Addison-Wesley. Unfortunately, this is no longer the case,
3101 although you might be able to find copies floating around the Internet.
3102 For a number of years, there was an active Software Tools Users Group,
3103 whose members had ported the original @code{ratfor} programs to essentially
3104 every computer system with a FORTRAN compiler. The popularity of the
3105 group waned in the middle '80s as Unix began to spread beyond universities.
3107 With the current proliferation of GNU code and other clones of Unix programs,
3108 these programs now receive little attention; modern C versions are
3109 much more efficient and do more than these programs do. Nevertheless, as
3110 exposition of good programming style, and evangelism for a still-valuable
3111 philosophy, these books are unparalleled, and I recommend them highly.
3113 Acknowledgement: I would like to express my gratitude to Brian Kernighan
3114 of Bell Labs, the original Software Toolsmith, for reviewing this column.
3126 @c texinfo-column-for-description: 32