3 @setfilename textutils.info
4 @settitle GNU text utilities
12 @c Put everything in one index (arbitrarily chosen to be the concept index).
22 * Text utilities: (textutils). GNU text utilities.
23 * cat: (textutils)cat invocation. Concatenate and write files.
24 * cksum: (textutils)cksum invocation. Print @sc{POSIX} CRC checksum.
25 * comm: (textutils)comm invocation. Compare sorted files by line.
26 * csplit: (textutils)csplit invocation. Split by context.
27 * cut: (textutils)cut invocation. Print selected parts of lines.
28 * expand: (textutils)expand invocation. Convert tabs to spaces.
29 * fmt: (textutils)fmt invocation. Reformat paragraph text.
30 * fold: (textutils)fold invocation. Wrap long input lines.
31 * head: (textutils)head invocation. Output the first part of files.
32 * join: (textutils)join invocation. Join lines on a common field.
33 * md5sum: (textutils)md5sum invocation. Print or check message-digests.
34 * nl: (textutils)nl invocation. Number lines and write files.
35 * od: (textutils)od invocation. Dump files in octal, etc.
36 * paste: (textutils)paste invocation. Merge lines of files.
37 * pr: (textutils)pr invocation. Paginate or columnate files.
38 * sort: (textutils)sort invocation. Sort text files.
39 * split: (textutils)split invocation. Split into fixed-size pieces.
40 * sum: (textutils)sum invocation. Print traditional checksum.
41 * tac: (textutils)tac invocation. Reverse files.
42 * tail: (textutils)tail invocation. Output the last part of files.
43 * tr: (textutils)tr invocation. Translate characters.
44 * unexpand: (textutils)unexpand invocation. Convert spaces to tabs.
45 * uniq: (textutils)uniq invocation. Uniqify files.
46 * wc: (textutils)wc invocation. Byte, word, and line counts.
52 This file documents the GNU text utilities.
54 Copyright (C) 1994, 95, 96 Free Software Foundation, Inc.
56 Permission is granted to make and distribute verbatim copies of
57 this manual provided the copyright notice and this permission notice
58 are preserved on all copies.
61 Permission is granted to process this file through TeX and print the
62 results, provided the printed document carries copying permission
63 notice identical to this one except for the removal of this paragraph
64 (this paragraph not being relevant to the printed manual).
67 Permission is granted to copy and distribute modified versions of this
68 manual under the conditions for verbatim copying, provided that the entire
69 resulting derived work is distributed under the terms of a permission
70 notice identical to this one.
72 Permission is granted to copy and distribute translations of this manual
73 into another language, under the above conditions for modified versions,
74 except that this permission notice may be stated in a translation approved
79 @title GNU @code{textutils}
80 @subtitle A set of text utilities
81 @subtitle for version @value{VERSION}, @value{UPDATED}
82 @author David MacKenzie et al.
85 @vskip 0pt plus 1filll
86 Copyright @copyright{} 1994, 95, 96 Free Software Foundation, Inc.
88 Permission is granted to make and distribute verbatim copies of
89 this manual provided the copyright notice and this permission notice
90 are preserved on all copies.
92 Permission is granted to copy and distribute modified versions of this
93 manual under the conditions for verbatim copying, provided that the entire
94 resulting derived work is distributed under the terms of a permission
95 notice identical to this one.
97 Permission is granted to copy and distribute translations of this manual
98 into another language, under the above conditions for modified versions,
99 except that this permission notice may be stated in a translation approved
106 @top GNU text utilities
108 @cindex text utilities
109 @cindex utilities for text handling
111 This manual documents version @value{VERSION} of the GNU text utilities.
114 * Introduction:: Caveats, overview, and authors.
115 * Common options:: Common options.
116 * Output of entire files:: cat tac nl od
117 * Formatting file contents:: fmt pr fold
118 * Output of parts of files:: head tail split csplit
119 * Summarizing files:: wc sum cksum md5sum
120 * Operating on sorted files:: sort uniq comm
121 * Operating on fields within a line:: cut paste join
122 * Operating on characters:: tr expand unexpand
123 * Opening the software toolbox:: The software tools philosophy.
124 * Index:: General index.
130 @chapter Introduction
134 This manual is incomplete: No attempt is made to explain basic concepts
135 in a way suitable for novices. Thus, if you are interested, please get
136 involved in improving this manual. The entire GNU community will
140 The GNU text utilities are mostly compatible with the @sc{POSIX.2} standard.
142 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
143 @c sh-utils.texi too -- so be sure to keep them consistent.
144 @cindex bugs, reporting
145 Please report bugs to @samp{textutils-bugs@@gnu.ai.mit.edu}. Remember
146 to include the version number, machine architecture, input files, and
147 any other information needed to reproduce the bug: your input, what you
148 expected, what you got, and why it is wrong. Diffs are welcome, but
149 please include a description of the problem as well, since this is
150 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
152 This manual was originally derived from the Unix man pages in the
153 distribution, which were written by David MacKenzie and updated by Jim
154 Meyering. What you are reading now is the authoritative documentation
155 for these utilities; the man pages are no longer being maintained.
156 The original @code{fmt} man page was written by Ross Paterson.
157 Fran@,{c}ois Pinard did the initial conversion to Texinfo format.
158 Karl Berry did the indexing, some reorganization, and editing of the results.
159 Richard Stallman contributed his usual invaluable insights to the
164 @chapter Common options
166 @cindex common options
168 Certain options are available in all these programs. Rather than
169 writing identical descriptions for each of the programs, they are
170 described here. (In fact, every GNU program accepts (or should accept)
173 A few of these programs take arbitrary strings as arguments. In those
174 cases, @samp{--help} and @samp{--version} are taken as these options
175 only if there is one and exactly one command line argument.
182 Print a usage message listing all available options, then exit successfully.
186 @cindex version number, finding
187 Print the version number, then exit successfully.
192 @node Output of entire files
193 @chapter Output of entire files
195 @cindex output of entire files
196 @cindex entire files, output of
198 These commands read and write entire files, possibly transforming them
202 * cat invocation:: Concatenate and write files.
203 * tac invocation:: Concatenate and write files in reverse.
204 * nl invocation:: Number lines and write files.
205 * od invocation:: Write files in octal or other formats.
209 @section @code{cat}: Concatenate and write files
212 @cindex concatenate and write files
213 @cindex copying files
215 @code{cat} copies each @var{file} (@samp{-} means standard input), or
216 standard input if none are given, to standard output. Synopsis:
219 cat [@var{option}] [@var{file}]@dots{}
222 The program accepts the following options. Also see @ref{Common options}.
230 Equivalent to @samp{-vET}.
233 @itemx --number-nonblank
235 @opindex --number-nonblank
236 Number all nonblank output lines, starting with 1.
240 Equivalent to @samp{-vE}.
246 Display a @samp{$} after the end of each line.
252 Number all output lines, starting with 1.
255 @itemx --squeeze-blank
257 @opindex --squeeze-blank
258 @cindex squeezing blank lines
259 Replace multiple adjacent blank lines with a single blank line.
263 Equivalent to @samp{-vT}.
269 Display @key{TAB} characters as @samp{^I}.
273 Ignored; for Unix compatibility.
276 @itemx --show-nonprinting
278 @opindex --show-nonprinting
279 Display control characters except for @key{LFD} and @key{TAB} using
280 @samp{^} notation and precede characters that have the high bit set
287 @section @code{tac}: Concatenate and write files in reverse
290 @cindex reversing files
292 @code{tac} copies each @var{file} (@samp{-} means standard input), or
293 standard input if none are given, to standard output, reversing the
294 records (lines by default) in each separately. Synopsis:
297 tac [@var{option}]@dots{} [@var{file}]@dots{}
300 @dfn{Records} are separated by instances of a string (newline by
301 default). By default, this separator string is attached to the end of
302 the record that it follows in the file.
304 The program accepts the following options. Also see @ref{Common options}.
312 The separator is attached to the beginning of the record that it
313 precedes in the file.
319 Treat the separator string as a regular expression.
321 @item -s @var{separator}
322 @itemx --separator=@var{separator}
325 Use @var{separator} as the record separator, instead of newline.
331 @section @code{nl}: Number lines and write files
334 @cindex numbering lines
335 @cindex line numbering
337 @code{nl} writes each @var{file} (@samp{-} means standard input), or
338 standard input if none are given, to standard output, with line numbers
339 added to some or all of the lines. Synopsis:
342 nl [@var{option}]@dots{} [@var{file}]@dots{}
345 @cindex logical pages, numbering on
346 @code{nl} decomposes its input into (logical) pages; by default, the
347 line number is reset to 1 at the top of each logical page. @code{nl}
348 treats all of the input files as a single document; it does not reset
349 line numbers or logical pages between files.
351 @cindex headers, numbering
352 @cindex body, numbering
353 @cindex footers, numbering
354 A logical page consists of three sections: header, body, and footer.
355 Any of the sections can be empty. Each can be numbered in a different
356 style from the others.
358 The beginnings of the sections of logical pages are indicated in the
359 input file by a line containing exactly one of these delimiter strings:
370 The two characters from which these strings are made can be changed from
371 @samp{\} and @samp{:} via options (see below), but the pattern and
372 length of each string cannot be changed.
374 A section delimiter is replaced by an empty line on output. Any text
375 that comes before the first section delimiter string in the input file
376 is considered to be part of a body section, so @code{nl} treats a
377 file that contains no section delimiters as a single body section.
379 The program accepts the following options. Also see @ref{Common options}.
384 @itemx --body-numbering=@var{style}
386 @opindex --body-numbering
387 Select the numbering style for lines in the body section of each
388 logical page. When a line is not numbered, the current line number
389 is not incremented, but the line number separator character is still
390 prepended to the line. The styles are:
396 number only nonempty lines (default for body),
398 do not number lines (default for header and footer),
400 number only lines that contain a match for @var{regexp}.
404 @itemx --section-delimiter=@var{cd}
406 @opindex --section-delimiter
407 @cindex section delimiters of pages
408 Set the section delimiter characters to @var{cd}; default is
409 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
410 (Remember to protect @samp{\} or other metacharacters from shell
411 expansion with quotes or extra backslashes.)
414 @itemx --footer-numbering=@var{style}
416 @opindex --footer-numbering
417 Analogous to @samp{--body-numbering}.
420 @itemx --header-numbering=@var{style}
422 @opindex --header-numbering
423 Analogous to @samp{--body-numbering}.
425 @item -i @var{number}
426 @itemx --page-increment=@var{number}
428 @opindex --page-increment
429 Increment line numbers by @var{number} (default 1).
431 @item -l @var{number}
432 @itemx --join-blank-lines=@var{number}
434 @opindex --join-blank-lines
435 @cindex empty lines, numbering
436 @cindex blank lines, numbering
437 Consider @var{number} (default 1) consecutive empty lines to be one
438 logical line for numbering, and only number the last one. Where fewer
439 than @var{number} consecutive empty lines occur, do not number them.
440 An empty line is one that contains no characters, not even spaces
443 @item -n @var{format}
444 @itemx --number-format=@var{format}
446 @opindex --number-format
447 Select the line numbering format (default is @code{rn}):
451 @opindex ln @r{format for @code{nl}}
452 left justified, no leading zeros;
454 @opindex rn @r{format for @code{nl}}
455 right justified, no leading zeros;
457 @opindex rz @r{format for @code{nl}}
458 right justified, leading zeros.
464 @opindex --no-renumber
465 Do not reset the line number at the start of a logical page.
467 @item -s @var{string}
468 @itemx --number-separator=@var{string}
470 @opindex --number-separator
471 Separate the line number from the text line in the output with
472 @var{string} (default is @key{TAB}).
474 @item -v @var{number}
475 @itemx --starting-line-number=@var{number}
477 @opindex --starting-line-number
478 Set the initial line number on each logical page to @var{number} (default 1).
480 @item -w @var{number}
481 @itemx --number-width=@var{number}
483 @opindex --number-width
484 Use @var{number} characters for line numbers (default 6).
490 @section @code{od}: Write files in octal or other formats
493 @cindex octal dump of files
494 @cindex hex dump of files
495 @cindex ASCII dump of files
496 @cindex file contents, dumping unambiguously
498 @code{od} writes an unambiguous representation of each @var{file}
499 (@samp{-} means standard input), or standard input if none are given.
503 od [@var{option}]@dots{} [@var{file}]@dots{}
504 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
507 Each line of output consists of the offset in the input, followed by
508 groups of data from the file. By default, @code{od} prints the offset in
509 octal, and each group of file data is two bytes of input printed as a
512 The program accepts the following options. Also see @ref{Common options}.
517 @itemx --address-radix=@var{radix}
519 @opindex --address-radix
520 @cindex radix for file offsets
521 @cindex file offset radix
522 Select the base in which file offsets are printed. @var{radix} can
523 be one of the following:
533 none (do not print offsets).
536 The default is octal.
539 @itemx --skip-bytes=@var{bytes}
541 @opindex --skip-bytes
542 Skip @var{bytes} input bytes before formatting and writing. If
543 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
544 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
545 in decimal. Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
546 by 1024, and @samp{m} by 1048576.
549 @itemx --read-bytes=@var{bytes}
551 @opindex --read-bytes
552 Output at most @var{bytes} bytes of the input. Prefixes and suffixes on
553 @code{bytes} are interpreted as for the @samp{-j} option.
556 @itemx --strings[=@var{n}]
559 @cindex string constants, outputting
560 Instead of the normal output, output only @dfn{string constants}: at
561 least @var{n} (3 by default) consecutive ASCII graphic characters,
562 followed by a null (zero) byte.
565 @itemx --format=@var{type}
568 Select the format in which to output the file data. @var{type} is a
569 string of one or more of the below type indicator characters. If you
570 include more than one type indicator character in a single @var{type}
571 string, or use this option more than once, @code{od} writes one copy
572 of each output line using each of the data types that you specified,
573 in the order that you specified.
575 Adding a trailing ``z'' to any type specification appends a display
576 of the ASCII character representation of the printable characters
577 to the output line generated by the type specification.
583 ASCII character or backslash escape,
596 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
597 newline, and @samp{nul} for a null (zero) byte. Type @code{c} outputs
598 @samp{ }, @samp{\n}, and @code{\0}, respectively.
601 Except for types @samp{a} and @samp{c}, you can specify the number
602 of bytes to use in interpreting each number in the given data type
603 by following the type indicator character with a decimal integer.
604 Alternately, you can specify the size of one of the C compiler's
605 built-in data types by following the type indicator character with
606 one of the following characters. For integers (@samp{d}, @samp{o},
620 For floating point (@code{f}):
632 @itemx --output-duplicates
634 @opindex --output-duplicates
635 Output consecutive lines that are identical. By default, when two or
636 more consecutive output lines would be identical, @code{od} outputs only
637 the first line, and puts just an asterisk on the following line to
638 indicate the elision.
641 @itemx --width[=@var{n}]
644 Dump @code{n} input bytes per output line. This must be a multiple of
645 the least common multiple of the sizes associated with the specified
646 output types. If @var{n} is omitted, the default is 32. If this option
647 is not given at all, the default is 16.
651 The next several options map the old, pre-@sc{POSIX} format specification
652 options to the corresponding @sc{POSIX} format specs. GNU @code{od} accepts
653 any combination of old- and new-style options. Format specification
660 Output as named characters. Equivalent to @samp{-ta}.
664 Output as octal bytes. Equivalent to @samp{-toC}.
668 Output as ASCII characters or backslash escapes. Equivalent to
673 Output as unsigned decimal shorts. Equivalent to @samp{-tu2}.
677 Output as floats. Equivalent to @samp{-tfF}.
681 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
685 Output as decimal shorts. Equivalent to @samp{-td2}.
689 Output as decimal longs. Equivalent to @samp{-td4}.
693 Output as octal shorts. Equivalent to @samp{-to2}.
697 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
701 @opindex --traditional
702 Recognize the pre-POSIX non-option arguments that traditional @code{od}
703 accepted. The following syntax:
706 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
710 can be used to specify at most one file and optional arguments
711 specifying an offset and a pseudo-start address, @var{label}. By
712 default, @var{offset} is interpreted as an octal number specifying how
713 many input bytes to skip before formatting and writing. The optional
714 trailing decimal point forces the interpretation of @var{offset} as a
715 decimal number. If no decimal is specified and the offset begins with
716 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number. If
717 there is a trailing @samp{b}, the number of bytes skipped will be
718 @var{offset} multiplied by 512. The @var{label} argument is interpreted
719 just like @var{offset}, but it specifies an initial pseudo-address. The
720 pseudo-addresses are displayed in parentheses following any normal
726 @node Formatting file contents
727 @chapter Formatting file contents
729 @cindex formatting file contents
731 These commands reformat the contents of files.
734 * fmt invocation:: Reformat paragraph text.
735 * pr invocation:: Paginate or columnate files for printing.
736 * fold invocation:: Wrap input lines to fit in specified width.
741 @section @code{fmt}: Reformat paragraph text
744 @cindex reformatting paragraph text
745 @cindex paragraphs, reformatting
746 @cindex text, reformatting
748 @code{fmt} fills and joins lines to produce output lines of (at most)
749 a given number of characters (75 by default). Synopsis:
752 fmt [@var{option}]@dots{} [@var{file}]@dots{}
755 @code{fmt} reads from the specified @var{file} arguments (or standard
756 input if none are given), and writes to standard output.
758 By default, blank lines, spaces between words, and indentation are
759 preserved in the output; successive input lines with different
760 indentation are not joined; tabs are expanded on input and introduced on
763 @cindex line-breaking
764 @cindex sentences and line-breaking
765 @cindex Knuth, Donald E.
766 @cindex Plass, Michael F.
767 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
768 avoid line breaks after the first word of a sentence or before the last
769 word of a sentence. A @dfn{sentence break} is defined as either the end
770 of a paragraph or a word ending in any of @samp{.?!}, followed by two
771 spaces or end of line, ignoring any intervening parentheses or quotes.
772 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
773 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
774 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
775 and Experience}, 11 (1981), 1119--1184).
777 The program accepts the following options. Also see @ref{Common options}.
782 @itemx --crown-margin
784 @opindex --crown-margin
786 @dfn{Crown margin} mode: preserve the indentation of the first two
787 lines within a paragraph, and align the left margin of each subsequent
788 line with that of the second line.
791 @itemx --tagged-paragraph
793 @opindex --tagged-paragraph
794 @cindex tagged paragraphs
795 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
796 indentation of the first line of a paragraph is the same as the
797 indentation of the second, the first line is treated as a one-line
803 @opindex --split-only
804 Split lines only. Do not join short lines to form longer ones. This
805 prevents sample lines of code, and other such ``formatted'' text from
806 being unduly combined.
809 @itemx --uniform-spacing
811 @opindex --uniform-spacing
812 Uniform spacing. Reduce spacing between words to one space, and spacing
813 between sentences to two spaces.
816 @itemx -w @var{width}
817 @itemx --width=@var{width}
818 @opindex -@var{width}
821 Fill output lines up to @var{width} characters (default 75). @code{fmt}
822 initially tries to make lines about 7% shorter than this, to give it
823 room to balance line lengths.
825 @item -p @var{prefix}
826 @itemx --prefix=@var{prefix}
827 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
828 are subject to formatting. The prefix and any preceding whitespace are
829 stripped for the formatting and then re-attached to each formatted output
830 line. One use is to format certain kinds of program comments, while
831 leaving the code unchanged.
837 @section @code{pr}: Paginate or columnate files for printing
840 @cindex printing, preparing files for
841 @cindex multicolumn output, generating
842 @cindex merging files in parallel
844 @code{pr} writes each @var{file} (@samp{-} means standard input), or
845 standard input if none are given, to standard output, paginating and
846 optionally outputting in multicolumn format; optionally merges all
847 @var{file}s, printing all in parallel, one per column. Synopsis:
850 pr [@var{option}]@dots{} [@var{file}]@dots{}
853 By default, a 5-line header is printed: two blank lines; a line with the
854 date, the file name, and the page count; and two more blank lines. A
855 footer of five blank lines is also printed. With the @samp{-f} option, a
856 3-line header is printed: the leading two blank lines are omitted; no
857 footer used. The default @var{page_length} in both cases is 66 lines.
858 The text line of the header takes up the full @var{page_width} in the
859 form @samp{yy-mm-dd HH:MM string Page nnnn}. String is a centered
862 Form feeds in the input cause page breaks in the output. Multiple form
863 feeds produce empty pages.
865 Columns have equal width, separated by an optional string (default
866 space). Lines will always be truncated to line width (default 72),
867 unless you use the @samp{-j} option. For single column output no line
868 truncation occurs by default. Use @samp{-w} option to truncate lines
871 The program accepts the following options. Also see @ref{Common options}.
875 @item +@var{first_page}[@var{:last_page}]
876 @opindex +@var{first_page}[@var{:last_page}]
877 Begin printing with page @var{first_page} and stop with
878 @var{last_page}. Missing @samp{:LAST_PAGE} implies end of file. While
879 estimating the number of skipped pages each form feed in the input file
880 results in a new page. Page counting with and without
881 @samp{+@var{first_page}} is identical. By default, it starts with the
882 first page of input file (not first page printed). Page numbering may be
883 altered by @samp{-N} option.
886 @opindex -@var{column}
888 With each single @var{file}, produce @var{column}-column output and
889 print columns down. The column width is automatically estimated from
890 @var{page_width}. This option might well cause some columns to be
891 truncated. The number of lines in the columns on each page will be
892 balanced. @samp{-@var{column}} may not be used with @samp{-m} option.
896 @cindex across columns
897 With each single @var{file}, print columns across rather than down.
898 @var{column} must be greater than one.
902 Print control characters using hat notation (e.g., @samp{^G}); print
903 other unprintable characters in octal backslash notation. By default,
904 unprintable characters are not changed.
908 @cindex double spacing
909 Double space the output.
911 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
914 Expand tabs to spaces on input. Optional argument @var{in-tabchar} is
915 the input tab character (default is @key{TAB}). Second optional
916 argument @var{in-tabwidth} is the input tab character's width (default
923 Use a form feed instead of newlines to separate output pages. Default
924 page length of 66 lines is not altered. But the number of lines of text
925 per page changes from 56 to 63 lines.
928 @item -h @var{HEADER}
930 Replace the file name in the header with the centered string
931 @var{header}. Left-hand-side truncation (marked by a @samp{*}) may occur
932 if the total header line @samp{yy-mm-dd HH:MM HEADER Page nnnn}
933 becomes larger than @var{page_width}. @samp{-h ""} prints a blank line
934 header. Don't use @samp{-h""}. A space between the -h option and the
935 argument is always peremptory.
937 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
940 Replace spaces with tabs on output. Optional argument @var{out-tabchar}
941 is the output tab character (default is @key{TAB}). Second optional
942 argument @var{out-tabwidth} is the output tab character's width (default
947 Merge lines of full length. Used together with the column options
948 @samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}. Turns off
949 @samp{-w} line truncation; no column alignment used; may be used with
950 @samp{-s[@var{separator}]}.
953 @item -l @var{page_length}
955 Set the page length to @var{page_length} (default 66) lines. If
956 @var{page_length} is less than or equal 10 (and <= 3 with @samp{-f}),
957 the headers and footers are omitted, and all form feeds set in input
958 files are eliminated, as if the @samp{-T} option had been given.
962 Merge and print all @var{file}s in parallel, one in each column. If a
963 line is too long to fit in a column, it is truncated (but see
964 @samp{-j}). @samp{-s[@var{separator}]} may be used. Empty pages in some
965 @var{file}s (form feeds set) produce empty columns, still marked by
966 @var{separator}. Completely empty common pages show no separators or
967 line numbers. The default header becomes
968 @samp{yy-mm-dd HH:MM <blanks> Page nnnn}; may be used with
969 @samp{-h @var{header}} to fill up the middle part.
972 @item -n[@var{number-separator}[@var{digits}]]
974 Precede each column with a line number; with parallel @var{file}s
975 (@samp{-m}), precede only each line with a line number. Optional argument
976 @var{number-separator} is the character to print after each number
977 (default is @key{TAB}). Optional argument @var{digits} is the number of
978 digits per line number (default is 5). Default line counting starts with
979 first line of the input file (not with the first line printed, see
982 @item -N @var{line_number}
984 Start line counting with no. @var{line_number} at first line of first
989 @cindex indenting lines
991 Indent each line with @var{n} (default is zero) spaces wide, i.e., set
992 the left margin. The total page width is @var{n} plus the width set
993 with the @samp{-w} option.
997 Do not print a warning message when an argument @var{file} cannot be
998 opened. (The exit status will still be nonzero, however.)
1000 @item -s[@var{separator}]
1002 Separate columns by a string @var{separator}. Don't use
1003 @samp{-s @var{separator}}, no space between flag and argument. If this
1004 option is omitted altogether, the default is @key{TAB} together with
1005 @samp{-j} option and space otherwise (same as @samp{-s" "}). With
1006 @samp{-s} only, no separator is used (same as @samp{-s""}). @samp{-s}
1007 does not affect line truncation or column alignment.
1011 Do not print the usual header [and footer] on each page, and do not fill
1012 out the bottoms of pages (with blank lines or a form feed). No page
1013 structure is produced, but retain form feeds set in the input files. The
1014 predefined page layout is not changed. @samp{-t} or @samp{-T} may be
1015 useful together with other options; e.g.: @samp{-t -e4}, expand
1016 @key{TAB} in the input file to 4 spaces but do not do any other changes.
1017 Use of @samp{-t} overrides @samp{-h}.
1021 Do not print header [and footer]. In addition eliminate all form feeds
1022 set in the input files.
1026 Print unprintable characters in octal backslash notation.
1028 @item -w @var{page_width}
1030 Set the page width to @var{page_width} (default 72) characters.
1031 With/without @samp{-w}, header lines are always truncated to
1032 @var{page_width} characters. With @samp{-w}, text lines are truncated,
1033 unless @samp{-j} is used. Without @samp{-w} together with one of the
1034 column options @samp{-@var{column}}, @samp{-a -@var{column}} or
1035 @samp{-m}, default truncation of text lines to 72 characters is used.
1036 Without @samp{-w} and without any of the column options, no line
1037 truncation is used. That's equivalent to @samp{-w 72 -j}.
1042 @node fold invocation
1043 @section @code{fold}: Wrap input lines to fit in specified width
1046 @cindex wrapping long input lines
1047 @cindex folding long input lines
1049 @code{fold} writes each @var{file} (@samp{-} means standard input), or
1050 standard input if none are given, to standard output, breaking long
1054 fold [@var{option}]@dots{} [@var{file}]@dots{}
1057 By default, @code{fold} breaks lines wider than 80 columns. The output
1058 is split into as many lines as necessary.
1060 @cindex screen columns
1061 @code{fold} counts screen columns by default; thus, a tab may count more
1062 than one column, backspace decreases the column count, and carriage
1063 return sets the column to zero.
1065 The program accepts the following options. Also see @ref{Common options}.
1073 Count bytes rather than columns, so that tabs, backspaces, and carriage
1074 returns are each counted as taking up one column, just like other
1081 Break at word boundaries: the line is broken after the last blank before
1082 the maximum line length. If the line contains no such blanks, the line
1083 is broken at the maximum line length as usual.
1085 @item -w @var{width}
1086 @itemx --width=@var{width}
1089 Use a maximum line length of @var{width} columns instead of 80.
1094 @node Output of parts of files
1095 @chapter Output of parts of files
1097 @cindex output of parts of files
1098 @cindex parts of files, output of
1100 These commands output pieces of the input.
1103 * head invocation:: Output the first part of files.
1104 * tail invocation:: Output the last part of files.
1105 * split invocation:: Split a file into fixed-size pieces.
1106 * csplit invocation:: Split a file into context-determined pieces.
1109 @node head invocation
1110 @section @code{head}: Output the first part of files
1113 @cindex initial part of files, outputting
1114 @cindex first part of files, outputting
1116 @code{head} prints the first part (10 lines by default) of each
1117 @var{file}; it reads from standard input if no files are given or
1118 when given a @var{file} of @samp{-}. Synopses:
1121 head [@var{option}]@dots{} [@var{file}]@dots{}
1122 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1125 If more than one @var{file} is specified, @code{head} prints a
1126 one-line header consisting of
1128 ==> @var{file name} <==
1131 before the output for each @var{file}.
1133 @code{head} accepts two option formats: the new one, in which numbers
1134 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1135 the number precedes any option letters (@samp{-1q}).
1137 The program accepts the following options. Also see @ref{Common options}.
1141 @item -@var{count}@var{options}
1142 @opindex -@var{count}
1143 This option is only recognized if it is specified first. @var{count} is
1144 a decimal number optionally followed by a size letter (@samp{b},
1145 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1146 or other option letters (@samp{cqv}).
1148 @item -c @var{bytes}
1149 @itemx --bytes=@var{bytes}
1152 Print the first @var{bytes} bytes, instead of initial lines. Appending
1153 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1157 @itemx --lines=@var{n}
1160 Output the first @var{n} lines.
1168 Never print file name headers.
1174 Always print file name headers.
1179 @node tail invocation
1180 @section @code{tail}: Output the last part of files
1183 @cindex last part of files, outputting
1185 @code{tail} prints the last part (10 lines by default) of each
1186 @var{file}; it reads from standard input if no files are given or
1187 when given a @var{file} of @samp{-}. Synopses:
1190 tail [@var{option}]@dots{} [@var{file}]@dots{}
1191 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1192 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1195 If more than one @var{file} is specified, @code{tail} prints a
1196 one-line header consisting of
1198 ==> @var{file name} <==
1201 before the output for each @var{file}.
1203 @cindex BSD @code{tail}
1204 GNU @code{tail} can output any amount of data (some other versions of
1205 @code{tail} cannot). It also has no @samp{-r} option (print in
1206 reverse), since reversing a file is really a different job from printing
1207 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1208 only reverse files that are at most as large as its buffer, which is
1209 typically 32k. A more reliable and versatile way to reverse files is
1210 the GNU @code{tac} command.
1212 @code{tail} accepts two option formats: the new one, in which numbers
1213 are arguments to the options (@samp{-n 1}), and the old one, in which
1214 the number precedes any option letters (@samp{-1} or @samp{+1}).
1216 If any option-argument is a number @var{n} starting with a @samp{+},
1217 @code{tail} begins printing with the @var{n}th item from the start of
1218 each file, instead of from the end.
1220 The program accepts the following options. Also see @ref{Common options}.
1226 @opindex -@var{count}
1227 @opindex +@var{count}
1228 This option is only recognized if it is specified first. @var{count} is
1229 a decimal number optionally followed by a size letter (@samp{b},
1230 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1231 or other option letters (@samp{cfqv}).
1233 @item -c @var{bytes}
1234 @itemx --bytes=@var{bytes}
1237 Output the last @var{bytes} bytes, instead of final lines. Appending
1238 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1245 @cindex growing files
1246 Loop forever trying to read more characters at the end of the file,
1247 presumably because the file is growing. Ignored if reading from a pipe.
1248 If more than one file is given, @code{tail} prints a header whenever it
1249 gets output from a different file, to indicate which file that output is
1253 @itemx --lines=@var{n}
1256 Output the last @var{n} lines.
1264 Never print file name headers.
1270 Always print file name headers.
1275 @node split invocation
1276 @section @code{split}: Split a file into fixed-size pieces
1279 @cindex splitting a file into pieces
1280 @cindex pieces, splitting a file into
1282 @code{split} creates output files containing consecutive sections of
1283 @var{input} (standard input if none is given or @var{input} is
1284 @samp{-}). Synopsis:
1287 split [@var{option}] [@var{input} [@var{prefix}]]
1290 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1291 left over for the last section), into each output file.
1293 @cindex output file name prefix
1294 The output files' names consist of @var{prefix} (@samp{x} by default)
1295 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1296 that concatenating the output files in sorted order by file name produces
1297 the original input file. (If more than 676 output files are required,
1298 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1300 The program accepts the following options. Also see @ref{Common options}.
1305 @itemx -l @var{lines}
1306 @itemx --lines=@var{lines}
1309 Put @var{lines} lines of @var{input} into each output file.
1311 @item -b @var{bytes}
1312 @itemx --bytes=@var{bytes}
1315 Put the first @var{bytes} bytes of @var{input} into each output file.
1316 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1317 @samp{m} by 1048576.
1319 @item -C @var{bytes}
1320 @itemx --line-bytes=@var{bytes}
1322 @opindex --line-bytes
1323 Put into each output file as many complete lines of @var{input} as
1324 possible without exceeding @var{bytes} bytes. For lines longer than
1325 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1326 less than @var{bytes} bytes of the line are left, then continue
1327 normally. @var{bytes} has the same format as for the @samp{--bytes}
1332 Write a diagnostic to standard error just before each output file is opened.
1337 @node csplit invocation
1338 @section @code{csplit}: Split a file into context-determined pieces
1341 @cindex context splitting
1342 @cindex splitting a file into pieces by context
1344 @code{csplit} creates zero or more output files containing sections of
1345 @var{input} (standard input if @var{input} is @samp{-}). Synopsis:
1348 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1351 The contents of the output files are determined by the @var{pattern}
1352 arguments, as detailed below. An error occurs if a @var{pattern}
1353 argument refers to a nonexistent line of the input file (e.g., if no
1354 remaining line matches a given regular expression). After every
1355 @var{pattern} has been matched, any remaining input is copied into one
1358 By default, @code{csplit} prints the number of bytes written to each
1359 output file after it has been created.
1361 The types of pattern arguments are:
1366 Create an output file containing the input up to but not including line
1367 @var{n} (a positive integer). If followed by a repeat count, also
1368 create an output file containing the next @var{line} lines of the input
1369 file once for each repeat.
1371 @item /@var{regexp}/[@var{offset}]
1372 Create an output file containing the current line up to (but not
1373 including) the next line of the input file that contains a match for
1374 @var{regexp}. The optional @var{offset} is a @samp{+} or @samp{-}
1375 followed by a positive integer. If it is given, the input up to the
1376 matching line plus or minus @var{offset} is put into the output file,
1377 and the line after that begins the next section of input.
1379 @item %@var{regexp}%[@var{offset}]
1380 Like the previous type, except that it does not create an output
1381 file, so that section of the input file is effectively ignored.
1383 @item @{@var{repeat-count}@}
1384 Repeat the previous pattern @var{repeat-count} additional
1385 times. @var{repeat-count} can either be a positive integer or an
1386 asterisk, meaning repeat as many times as necessary until the input is
1391 The output files' names consist of a prefix (@samp{xx} by default)
1392 followed by a suffix. By default, the suffix is an ascending sequence
1393 of two-digit decimal numbers from @samp{00} and up to @samp{99}. In any
1394 case, concatenating the output files in sorted order by filename
1395 produces the original input file.
1397 By default, if @code{csplit} encounters an error or receives a hangup,
1398 interrupt, quit, or terminate signal, it removes any output files
1399 that it has created so far before it exits.
1401 The program accepts the following options. Also see @ref{Common options}.
1405 @item -f @var{prefix}
1406 @itemx --prefix=@var{prefix}
1409 @cindex output file name prefix
1410 Use @var{prefix} as the output file name prefix.
1412 @item -b @var{suffix}
1413 @itemx --suffix=@var{suffix}
1416 @cindex output file name suffix
1417 Use @var{suffix} as the output file name suffix. When this option is
1418 specified, the suffix string must include exactly one
1419 @code{printf(3)}-style conversion specification, possibly including
1420 format specification flags, a field width, a precision specifications,
1421 or all of these kinds of modifiers. The format letter must convert a
1422 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1423 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed. The
1424 entire @var{suffix} is given (with the current output file number) to
1425 @code{sprintf(3)} to form the file name suffixes for each of the
1426 individual output files in turn. If this option is used, the
1427 @samp{--digits} option is ignored.
1429 @item -n @var{digits}
1430 @itemx --digits=@var{digits}
1433 Use output file names containing numbers that are @var{digits} digits
1434 long instead of the default 2.
1439 @opindex --keep-files
1440 Do not remove output files when errors are encountered.
1443 @itemx --elide-empty-files
1445 @opindex --elide-empty-files
1446 Suppress the generation of zero-length output files. (In cases where
1447 the section delimiters of the input file are supposed to mark the first
1448 lines of each of the sections, the first output file will generally be a
1449 zero-length file unless you use this option.) The output file sequence
1450 numbers always run consecutively starting from 0, even when this option
1461 Do not print counts of output file sizes.
1466 @node Summarizing files
1467 @chapter Summarizing files
1469 @cindex summarizing files
1471 These commands generate just a few numbers representing entire
1475 * wc invocation:: Print byte, word, and line counts.
1476 * sum invocation:: Print checksum and block counts.
1477 * cksum invocation:: Print CRC checksum and byte counts.
1478 * md5sum invocation:: Print or check message-digests.
1483 @section @code{wc}: Print byte, word, and line counts
1490 @code{wc} counts the number of bytes, whitespace-separated words, and
1491 newlines in each given @var{file}, or standard input if none are given
1492 or for a @var{file} of @samp{-}. Synopsis:
1495 wc [@var{option}]@dots{} [@var{file}]@dots{}
1498 @cindex total counts
1499 @code{wc} prints one line of counts for each file, and if the file was
1500 given as an argument, it prints the file name following the counts. If
1501 more than one @var{file} is given, @code{wc} prints a final line
1502 containing the cumulative counts, with the file name @file{total}. The
1503 counts are printed in this order: newlines, words, bytes.
1505 By default, @code{wc} prints all three counts. Options can specify
1506 that only certain counts be printed. Options do not undo others
1507 previously given, so
1514 prints both the byte counts and the word counts.
1516 With the @code{--max-line-length} option, @code{wc} prints the length
1517 of the longest line per file, and if there is more than one file it
1518 prints the maximum (not the sum) of those lengths.
1520 The program accepts the following options. Also see @ref{Common options}.
1530 Print only the byte counts.
1536 Print only the word counts.
1542 Print only the newline counts.
1545 @itemx --max-line-length
1547 @opindex --max-line-length
1548 Print only the maximum line lengths.
1553 @node sum invocation
1554 @section @code{sum}: Print checksum and block counts
1557 @cindex 16-bit checksum
1558 @cindex checksum, 16-bit
1560 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1561 standard input if none are given or for a @var{file} of @samp{-}. Synopsis:
1564 sum [@var{option}]@dots{} [@var{file}]@dots{}
1567 @code{sum} prints the checksum for each @var{file} followed by the
1568 number of blocks in the file (rounded up). If more than one @var{file}
1569 is given, file names are also printed (by default). (With the
1570 @samp{--sysv} option, corresponding file name are printed when there is
1571 at least one file argument.)
1573 By default, GNU @code{sum} computes checksums using an algorithm
1574 compatible with BSD @code{sum} and prints file sizes in units of
1577 The program accepts the following options. Also see @ref{Common options}.
1583 @cindex BSD @code{sum}
1584 Use the default (BSD compatible) algorithm. This option is included for
1585 compatibility with the System V @code{sum}. Unless @samp{-s} was also
1586 given, it has no effect.
1592 @cindex System V @code{sum}
1593 Compute checksums using an algorithm compatible with System V
1594 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1598 @code{sum} is provided for compatibility; the @code{cksum} program (see
1599 next section) is preferable in new applications.
1602 @node cksum invocation
1603 @section @code{cksum}: Print CRC checksum and byte counts
1606 @cindex cyclic redundancy check
1607 @cindex CRC checksum
1609 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1610 given @var{file}, or standard input if none are given or for a
1611 @var{file} of @samp{-}. Synopsis:
1614 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1617 @code{cksum} prints the CRC checksum for each file along with the number
1618 of bytes in the file, and the filename unless no arguments were given.
1620 @code{cksum} is typically used to ensure that files
1621 transferred by unreliable means (e.g., netnews) have not been corrupted,
1622 by comparing the @code{cksum} output for the received files with the
1623 @code{cksum} output for the original files (typically given in the
1626 The CRC algorithm is specified by the @sc{POSIX.2} standard. It is not
1627 compatible with the BSD or System V @code{sum} algorithms (see the
1628 previous section); it is more robust.
1630 The only options are @samp{--help} and @samp{--version}. @xref{Common
1634 @node md5sum invocation
1635 @section @code{md5sum}: Print or check message-digests
1638 @cindex 128-bit checksum
1639 @cindex checksum, 128-bit
1640 @cindex fingerprint, 128-bit
1641 @cindex message-digest, 128-bit
1643 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1644 @dfn{message-digest}) for each specified @var{file}.
1645 If a @var{file} is specified as @samp{-} or if no files are given
1646 @code{md5sum} computes the checksum for the standard input.
1647 @code{md5sum} can also determine whether a file and checksum are
1648 consistent. Synopses:
1651 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1652 md5sum [@var{option}]@dots{} --check [@var{file}]
1655 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1656 indicating a binary or text input file, and the filename.
1657 If @var{file} is omitted or specified as @samp{-}, standard input is read.
1659 The program accepts the following options. Also see @ref{Common options}.
1667 @cindex binary input files
1668 Treat all input files as binary. This option has no effect on Unix
1669 systems, since they don't distinguish between binary and text files.
1670 This option is useful on systems that have different internal and
1671 external character representations.
1675 Read filenames and checksum information from the single @var{file}
1676 (or from stdin if no @var{file} was specified) and report whether
1677 each named file and the corresponding checksum data are consistent.
1678 The input to this mode of @code{md5sum} is usually the output of
1679 a prior, checksum-generating run of @samp{md5sum}.
1680 Each valid line of input consists of an MD5 checksum, a binary/text
1681 flag, and then a filename.
1682 Binary files are marked with @samp{*}, text with @samp{ }.
1683 For each such line, @code{md5sum} reads the named file and computes its
1684 MD5 checksum. Then, if the computed message digest does not match the
1685 one on the line with the filename, the file is noted as having
1686 failed the test. Otherwise, the file passes the test.
1687 By default, for each valid line, one line is written to standard
1688 output indicating whether the named file passed the test.
1689 After all checks have been performed, if there were any failures,
1690 a warning is issued to standard error.
1691 Use the @samp{--status} option to inhibit that output.
1692 If any listed file cannot be opened or read, if any valid line has
1693 an MD5 checksum inconsistent with the associated file, or if no valid
1694 line is found, @code{md5sum} exits with nonzero status. Otherwise,
1695 it exits successfully.
1699 @cindex verifying MD5 checksums
1700 This option is useful only when verifying checksums.
1701 When verifying checksums, don't generate the default one-line-per-file
1702 diagnostic and don't output the warning summarizing any failures.
1703 Failures to open or read a file still evoke individual diagnostics to
1705 If all listed files are readable and are consistent with the associated
1706 MD5 checksums, exit successfully. Otherwise exit with a status code
1707 indicating there was a failure.
1713 @cindex text input files
1714 Treat all input files as text files. This is the reverse of
1721 @cindex verifying MD5 checksums
1722 When verifying checksums, warn about improperly formatted MD5 checksum lines.
1723 This option is useful only if all but a few lines in the checked input
1729 @node Operating on sorted files
1730 @chapter Operating on sorted files
1732 @cindex operating on sorted files
1733 @cindex sorted files, operations on
1735 These commands work with (or produce) sorted files.
1738 * sort invocation:: Sort text files.
1739 * uniq invocation:: Uniqify files.
1740 * comm invocation:: Compare two sorted files line by line.
1744 @node sort invocation
1745 @section @code{sort}: Sort text files
1748 @cindex sorting files
1750 @code{sort} sorts, merges, or compares all the lines from the given
1751 files, or standard input if none are given or for a @var{file} of
1752 @samp{-}. By default, @code{sort} writes the results to standard
1756 sort [@var{option}]@dots{} [@var{file}]@dots{}
1759 @code{sort} has three modes of operation: sort (the default), merge,
1760 and check for sortedness. The following options change the operation
1767 @cindex checking for sortedness
1768 Check whether the given files are already sorted: if they are not all
1769 sorted, print an error message and exit with a status of 1.
1770 Otherwise, exit successfully.
1774 @cindex merging sorted files
1775 Merge the given files by sorting them as a group. Each input file must
1776 always be individually sorted. It always works to sort instead of
1777 merge; merging is provided because it is faster, in the case where it
1782 A pair of lines is compared as follows: if any key fields have been
1783 specified, @code{sort} compares each pair of fields, in the order
1784 specified on the command line, according to the associated ordering
1785 options, until a difference is found or no fields are left.
1787 If any of the global options @samp{Mbdfinr} are given but no key fields
1788 are specified, @code{sort} compares the entire lines according to the
1791 Finally, as a last resort when all keys compare equal (or if no
1792 ordering options were specified at all), @code{sort} compares the lines
1793 byte by byte in machine collating sequence. The last resort comparison
1794 honors the @samp{-r} global option. The @samp{-s} (stable) option
1795 disables this last-resort comparison so that lines in which all fields
1796 compare equal are left in their original relative order. If no fields
1797 or global options are specified, @samp{-s} has no effect.
1799 GNU @code{sort} (as specified for all GNU utilities) has no limits on
1800 input line length or restrictions on bytes allowed within lines. In
1801 addition, if the final byte of an input file is not a newline, GNU
1802 @code{sort} silently supplies one.
1804 Upon any error, @code{sort} exits with a status of @samp{2}.
1807 If the environment variable @code{TMPDIR} is set, @code{sort} uses its
1808 value as the directory for temporary files instead of @file{/tmp}. The
1809 @samp{-T @var{tempdir}} option in turn overrides the environment
1812 The following options affect the ordering of output lines. They may be
1813 specified globally or as part of a specific key field. If no key
1814 fields are specified, global options apply to comparison of entire
1815 lines; otherwise the global options are inherited by key fields that do
1816 not specify any special options of their own.
1822 @cindex blanks, ignoring leading
1823 Ignore leading blanks when finding sort keys in each line.
1827 @cindex phone directory order
1828 @cindex telephone directory order
1829 Sort in @dfn{phone directory} order: ignore all characters except
1830 letters, digits and blanks when sorting.
1834 @cindex case folding
1835 Fold lowercase characters into the equivalent uppercase characters when
1836 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
1840 @cindex general numeric sort
1841 Sort numerically, but use strtod(3) to arrive at the numeric values.
1842 This allows floating point numbers to be specified in scientific notation,
1843 like @code{1.0e-34} and @code{10e100}. Use this option only if there
1844 is no alternative; it is much slower than @samp{-n} and numbers with
1845 too many significant digits will be compared as if they had been
1846 truncated. In addition, numbers outside the range of representable
1847 double precision floating point numbers are treated as if they were
1848 zeroes; overflow and underflow are not reported.
1852 @cindex unprintable characters, ignoring
1853 Ignore characters outside the printable ASCII range 040-0176 octal
1854 (inclusive) when sorting.
1858 @cindex months, sorting by
1859 An initial string, consisting of any amount of whitespace, followed
1860 by three letters abbreviating a month name, is folded to UPPER case and
1861 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
1862 Invalid names compare low to valid names.
1866 @cindex numeric sort
1867 Sort numerically: the number begins each line; specifically, it consists
1868 of optional whitespace, an optional @samp{-} sign, and zero or more
1869 digits, optionally followed by a decimal point and zero or more digits.
1871 @code{sort -n} uses what might be considered an unconventional method
1872 to compare strings representing floating point numbers. Rather than
1873 first converting each string to the C @code{double} type and then
1874 comparing those values, sort aligns the decimal points in the two
1875 strings and compares the strings a character at a time. One benefit
1876 of using this approach is its speed. In practice this is much more
1877 efficient than performing the two corresponding string-to-double (or even
1878 string-to-integer) conversions and then comparing doubles. In addition,
1879 there is no corresponding loss of precision. Converting each string to
1880 @code{double} before comparison would limit precision to about 16 digits
1883 Neither a leading @samp{+} nor exponential notation is recognized.
1884 To compare such strings numerically, use the @samp{-g} option.
1888 @cindex reverse sorting
1889 Reverse the result of comparison, so that lines with greater key values
1890 appear earlier in the output instead of later.
1898 @item -o @var{output-file}
1900 @cindex overwriting of input, allowed
1901 Write output to @var{output-file} instead of standard output.
1902 If @var{output-file} is one of the input files, @code{sort} copies
1903 it to a temporary file before sorting and writing the output to
1906 @item -t @var{separator}
1908 @cindex field separator character
1909 Use character @var{separator} as the field separator when finding the
1910 sort keys in each line. By default, fields are separated by the empty
1911 string between a non-whitespace character and a whitespace character.
1912 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
1913 into fields @w{@samp{ foo}} and @w{@samp{ bar}}. The field separator is
1914 not considered to be part of either the field preceding or the field
1919 @cindex uniqifying output
1920 For the default case or the @samp{-m} option, only output the first
1921 of a sequence of lines that compare equal. For the @samp{-c} option,
1922 check that no pair of consecutive lines compares equal.
1924 @item -k @var{pos1}[,@var{pos2}]
1927 The recommended, @sc{POSIX}, option for specifying a sort field. The field
1928 consists of the line between @var{pos1} and @var{pos2} (or the end of
1929 the line, if @var{pos2} is omitted), inclusive. Fields and character
1930 positions are numbered starting with 1. See below.
1934 @cindex sort zero-terminated lines
1935 Treat the input as a set of lines, each terminated by a zero byte (@sc{ASCII}
1936 @sc{NUL} (Null) character) instead of a @sc{ASCII} @sc{LF} (Line Feed.)
1937 This option can be useful in conjunction with @samp{perl -0} or
1938 @samp{find -print0} and @samp{xargs -0} which do the same in order to
1939 reliably handle arbitrary pathnames (even those which contain Line Feed
1942 @item +@var{pos1}[-@var{pos2}]
1943 The obsolete, traditional option for specifying a sort field. The field
1944 consists of the line between @var{pos1} and up to but @emph{not including}
1945 @var{pos2} (or the end of the line if @var{pos2} is omitted). Fields
1946 and character positions are numbered starting with 0. See below.
1950 In addition, when GNU @code{sort} is invoked with exactly one argument,
1951 options @samp{--help} and @samp{--version} are recognized. @xref{Common
1954 Historical (BSD and System V) implementations of @code{sort} have
1955 differed in their interpretation of some options, particularly
1956 @samp{-b}, @samp{-f}, and @samp{-n}. GNU sort follows the @sc{POSIX}
1957 behavior, which is usually (but not always!) like the System V behavior.
1958 According to @sc{POSIX}, @samp{-n} no longer implies @samp{-b}. For
1959 consistency, @samp{-M} has been changed in the same way. This may
1960 affect the meaning of character positions in field specifications in
1961 obscure cases. The only fix is to add an explicit @samp{-b}.
1963 A position in a sort field specified with the @samp{-k} or @samp{+}
1964 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
1965 of the field to use and @var{c} is the number of the first character
1966 from the beginning of the field (for @samp{+@var{pos}}) or from the end
1967 of the previous field (for @samp{-@var{pos}}). If the @samp{.@var{c}}
1968 is omitted, it is taken to be the first character in the field. If the
1969 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
1970 specification is counted from the first nonblank character of the field
1971 (for @samp{+@var{pos}}) or from the first nonblank character following
1972 the previous field (for @samp{-@var{pos}}).
1974 A sort key option may also have any of the option letters @samp{Mbdfinr}
1975 appended to it, in which case the global ordering options are not used
1976 for that particular field. The @samp{-b} option may be independently
1977 attached to either or both of the @samp{+@var{pos}} and
1978 @samp{-@var{pos}} parts of a field specification, and if it is inherited
1979 from the global options it will be attached to both.
1980 Keys may span multiple fields.
1982 Here are some examples to illustrate various combinations of options.
1983 In them, the @sc{POSIX} @samp{-k} option is used to specify sort keys rather
1984 than the obsolete @samp{+@var{pos1}-@var{pos2}} syntax.
1989 Sort in descending (reverse) numeric order.
1995 Sort alphabetically, omitting the first and second fields.
1996 This uses a single key composed of the characters beginning
1997 at the start of field three and extending to the end of each line.
2004 Sort numerically on the second field and resolve ties by sorting
2005 alphabetically on the third and fourth characters of field five.
2006 Use @samp{:} as the field delimiter.
2009 sort -t : -k 2,2n -k 5.3,5.4
2012 Note that if you had written @samp{-k 2} instead of @samp{-k 2,2}
2013 @samp{sort} would have used all characters beginning in the second field
2014 and extending to the end of the line as the primary @emph{numeric}
2015 key. For the large majority of applications, treating keys spanning
2016 more than one field as numeric will not do what you expect.
2018 Also note that the @samp{n} modifier was applied to the field-end
2019 specifier for the first key. It would have been equivalent to
2020 specify @samp{-k 2n,2} or @samp{-k 2n,2n}. All modifiers except
2021 @samp{b} apply to the associated @emph{field}, regardless of whether
2022 the modifier character is attached to the field-start and/or the
2023 field-end part of the key specifier.
2026 Sort the password file on the fifth field and ignore any
2027 leading white space. Sort lines with equal values in field five
2028 on the numeric user ID in field three.
2031 sort -t : -k 5b,5 -k 3,3n /etc/passwd
2034 An alternative is to use the global numeric modifier @samp{-n}.
2037 sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
2041 Generate a tags file in case insensitive sorted order.
2043 find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
2046 The use of @samp{-print0}, @samp{-z}, and @samp{-0} in this case mean
2047 that pathnames that contain Line Feed characters will not get broken up
2048 by the sort operation.
2050 Finally, to ignore both leading and trailing white space, you
2051 could have applied the @samp{b} modifier to the field-end specifier
2055 sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
2058 or by using the global @samp{-b} modifier instead of @samp{-n}
2059 and an explicit @samp{n} with the second key specifier.
2062 sort -t : -b -k 5,5 -k 3,3n /etc/passwd
2068 @node uniq invocation
2069 @section @code{uniq}: Uniqify files
2072 @cindex uniqify files
2074 @code{uniq} writes the unique lines in the given @file{input}, or
2075 standard input if nothing is given or for an @var{input} name of
2079 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
2082 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
2083 discards all but one of identical successive lines. Optionally, it can
2084 instead show only lines that appear exactly once, or lines that appear
2087 The input must be sorted. If your input is not sorted, perhaps you want
2088 to use @code{sort -u}.
2090 If no @var{output} file is specified, @code{uniq} writes to standard
2093 The program accepts the following options. Also see @ref{Common options}.
2099 @itemx --skip-fields=@var{n}
2102 @opindex --skip-fields
2103 Skip @var{n} fields on each line before checking for uniqueness. Fields
2104 are sequences of non-space non-tab characters that are separated from
2105 each other by at least one spaces or tabs.
2109 @itemx --skip-chars=@var{n}
2112 @opindex --skip-chars
2113 Skip @var{n} characters before checking for uniqueness. If you use both
2114 the field and character skipping options, fields are skipped over first.
2120 Print the number of times each line occurred along with the line.
2123 @itemx --ignore-case
2125 @opindex --ignore-case
2126 Ignore differences in case when comparing lines.
2132 @cindex duplicate lines, outputting
2133 Print only duplicate lines.
2139 @cindex unique lines, outputting
2140 Print only unique lines.
2143 @itemx --check-chars=@var{n}
2145 @opindex --check-chars
2146 Compare @var{n} characters on each line (after skipping any specified
2147 fields and characters). By default the entire rest of the lines are
2153 @node comm invocation
2154 @section @code{comm}: Compare two sorted files line by line
2157 @cindex line-by-line comparison
2158 @cindex comparing sorted files
2160 @code{comm} writes to standard output lines that are common, and lines
2161 that are unique, to two input files; a file name of @samp{-} means
2162 standard input. Synopsis:
2165 comm [@var{option}]@dots{} @var{file1} @var{file2}
2168 The input files must be sorted before @code{comm} can be used.
2170 @cindex differing lines
2171 @cindex common lines
2172 With no options, @code{comm} produces three column output. Column one
2173 contains lines unique to @var{file1}, column two contains lines unique
2174 to @var{file2}, and column three contains lines common to both files.
2175 Columns are separated by @key{TAB}.
2176 @c FIXME: when there's an option to supply an alternative separator
2177 @c string, append `by default' to the above sentence.
2182 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
2183 the corresponding columns. Also see @ref{Common options}.
2185 Unlike some other comparison utilities, @code{comm} has an exit
2186 status that does not depend on the result of the comparison.
2187 Upon normal completion @code{comm} produces an exit code of zero.
2188 If there is an error it exits with nonzero status.
2191 @node Operating on fields within a line
2192 @chapter Operating on fields within a line
2195 * cut invocation:: Print selected parts of lines.
2196 * paste invocation:: Merge lines of files.
2197 * join invocation:: Join lines on a common field.
2201 @node cut invocation
2202 @section @code{cut}: Print selected parts of lines
2205 @code{cut} writes to standard output selected parts of each line of each
2206 input file, or standard input if no files are given or for a file name of
2210 cut [@var{option}]@dots{} [@var{file}]@dots{}
2213 In the table which follows, the @var{byte-list}, @var{character-list},
2214 and @var{field-list} are one or more numbers or ranges (two numbers
2215 separated by a dash) separated by commas. Bytes, characters, and
2216 fields are numbered from starting at 1. Incomplete ranges may be
2217 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
2218 @samp{@var{n}} through end of line or last field.
2220 The program accepts the following options. Also see @ref{Common
2225 @item -b @var{byte-list}
2226 @itemx --bytes=@var{byte-list}
2229 Print only the bytes in positions listed in @var{byte-list}. Tabs and
2230 backspaces are treated like any other character; they take up 1 byte.
2232 @item -c @var{character-list}
2233 @itemx --characters=@var{character-list}
2235 @opindex --characters
2236 Print only characters in positions listed in @var{character-list}.
2237 The same as @samp{-b} for now, but internationalization will change
2238 that. Tabs and backspaces are treated like any other character; they
2239 take up 1 character.
2241 @item -f @var{field-list}
2242 @itemx --fields=@var{field-list}
2245 Print only the fields listed in @var{field-list}. Fields are
2246 separated by a @key{TAB} by default.
2248 @item -d @var{delim}
2249 @itemx --delimiter=@var{delim}
2251 @opindex --delimiter
2252 For @samp{-f}, fields are separated by the first character in @var{delim}
2253 (default is @key{TAB}).
2257 Do not split multi-byte characters (no-op for now).
2260 @itemx --only-delimited
2262 @opindex --only-delimited
2263 For @samp{-f}, do not print lines that do not contain the field separator
2269 @node paste invocation
2270 @section @code{paste}: Merge lines of files
2273 @cindex merging files
2275 @code{paste} writes to standard output lines consisting of sequentially
2276 corresponding lines of each given file, separated by @key{TAB}.
2277 Standard input is used for a file name of @samp{-} or if no input files
2283 paste [@var{option}]@dots{} [@var{file}]@dots{}
2286 The program accepts the following options. Also see @ref{Common options}.
2294 Paste the lines of one file at a time rather than one line from each
2297 @item -d @var{delim-list}
2298 @itemx --delimiters @var{delim-list}
2300 @opindex --delimiters
2301 Consecutively use the characters in @var{delim-list} instead of
2302 @key{TAB} to separate merged lines. When @var{delim-list} is
2303 exhausted, start again at its beginning.
2308 @node join invocation
2309 @section @code{join}: Join lines on a common field
2312 @cindex common field, joining on
2314 @code{join} writes to standard output a line for each pair of input
2315 lines that have identical join fields. Synopsis:
2318 join [@var{option}]@dots{} @var{file1} @var{file2}
2321 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
2322 meaning standard input. @var{file1} and @var{file2} should be already
2323 sorted in increasing order (not numerically) on the join fields; unless
2324 the @samp{-t} option is given, they should be sorted ignoring blanks at
2325 the start of the join field, as in @code{sort -b}. If the
2326 @samp{--ignore-case} option is given, lines should be sorted without
2327 regard to the case of characters in the join field, as in @code{sort -f}.
2329 The defaults are: the join field is the first field in each line;
2330 fields in the input are separated by one or more blanks, with leading
2331 blanks on the line ignored; fields in the output are separated by a
2332 space; each output line consists of the join field, the remaining
2333 fields from @var{file1}, then the remaining fields from @var{file2}.
2335 The program accepts the following options. Also see @ref{Common options}.
2339 @item -a @var{file-number}
2341 Print a line for each unpairable line in file @var{file-number} (either
2342 @samp{1} or @samp{2}), in addition to the normal output.
2344 @item -e @var{string}
2346 Replace those output fields that are missing in the input with
2350 @itemx --ignore-case
2352 @opindex --ignore-case
2353 Ignore differences in case when comparing keys.
2354 With this option, the lines of the input files must be ordered in the same way.
2355 Use @samp{sort -f} to produce this ordering.
2357 @item -1 @var{field}
2358 @itemx -j1 @var{field}
2361 Join on field @var{field} (a positive integer) of file 1.
2363 @item -2 @var{field}
2364 @itemx -j2 @var{field}
2367 Join on field @var{field} (a positive integer) of file 2.
2369 @item -j @var{field}
2370 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
2372 @item -o @var{field-list}@dots{}
2373 Construct each output line according to the format in @var{field-list}.
2374 Each element in @var{field-list} is either the single character @samp{0} or
2375 has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
2376 @samp{2} and @var{n} is a positive field number.
2378 A field specification of @samp{0} denotes the join field.
2379 In most cases, the functionality of the @samp{0} field spec
2380 may be reproduced using the explicit @var{m.n} that corresponds
2381 to the join field. However, when printing unpairable lines
2382 (using either of the @samp{-a} or @samp{-v} options), there is no way
2383 to specify the join field using @var{m.n} in @var{field-list}
2384 if there are unpairable lines in both files.
2385 To give @code{join} that functionality, @sc{POSIX} invented the @samp{0}
2386 field specification notation.
2388 The elements in @var{field-list}
2389 are separated by commas or blanks. Multiple @var{field-list}
2390 arguments can be given after a single @samp{-o} option; the values
2391 of all lists given with @samp{-o} are concatenated together.
2392 All output lines -- including those printed because of any -a or -v
2393 option -- are subject to the specified @var{field-list}.
2396 Use character @var{char} as the input and output field separator.
2398 @item -v @var{file-number}
2399 Print a line for each unpairable line in file @var{file-number}
2400 (either @samp{1} or @samp{2}), instead of the normal output.
2404 In addition, when GNU @code{join} is invoked with exactly one argument,
2405 options @samp{--help} and @samp{--version} are recognized. @xref{Common
2409 @node Operating on characters
2410 @chapter Operating on characters
2412 @cindex operating on characters
2414 This commands operate on individual characters.
2417 * tr invocation:: Translate, squeeze, and/or delete characters.
2418 * expand invocation:: Convert tabs to spaces.
2419 * unexpand invocation:: Convert spaces to tabs.
2424 @section @code{tr}: Translate, squeeze, and/or delete characters
2431 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
2434 @code{tr} copies standard input to standard output, performing
2435 one of the following operations:
2439 translate, and optionally squeeze repeated characters in the result,
2441 squeeze repeated characters,
2445 delete characters, then squeeze repeated characters from the result.
2448 The @var{set1} and (if given) @var{set2} arguments define ordered
2449 sets of characters, referred to below as @var{set1} and @var{set2}. These
2450 sets are the characters of the input that @code{tr} operates on.
2451 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
2452 complement (all of the characters that are not in @var{set1}).
2455 * Character sets:: Specifying sets of characters.
2456 * Translating:: Changing one characters to another.
2457 * Squeezing:: Squeezing repeats and deleting.
2458 * Warnings in tr:: Warning messages.
2462 @node Character sets
2463 @subsection Specifying sets of characters
2465 @cindex specifying sets of characters
2467 The format of the @var{set1} and @var{set2} arguments resembles
2468 the format of regular expressions; however, they are not regular
2469 expressions, only lists of characters. Most characters simply
2470 represent themselves in these strings, but the strings can contain
2471 the shorthands listed below, for convenience. Some of them can be
2472 used only in @var{set1} or @var{set2}, as noted below.
2476 @item Backslash escapes
2477 @cindex backslash escapes
2479 A backslash followed by a character not listed below causes an error
2498 The character with the value given by @var{ooo}, which is 1 to 3
2507 The notation @samp{@var{m}-@var{n}} expands to all of the characters
2508 from @var{m} through @var{n}, in ascending order. @var{m} should
2509 collate before @var{n}; if it doesn't, an error results. As an example,
2510 @samp{0-9} is the same as @samp{0123456789}. Although GNU @code{tr}
2511 does not support the System V syntax that uses square brackets to
2512 enclose ranges, translations specified in that format will still work as
2513 long as the brackets in @var{string1} correspond to identical brackets
2516 @item Repeated characters
2517 @cindex repeated characters
2519 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
2520 copies of character @var{c}. Thus, @samp{[y*6]} is the same as
2521 @samp{yyyyyy}. The notation @samp{[@var{c}*]} in @var{string2} expands
2522 to as many copies of @var{c} as are needed to make @var{set2} as long as
2523 @var{set1}. If @var{n} begins with @samp{0}, it is interpreted in
2524 octal, otherwise in decimal.
2526 @item Character classes
2527 @cindex characters classes
2529 The notation @samp{[:@var{class}:]} expands to all of the characters in
2530 the (predefined) class @var{class}. The characters expand in no
2531 particular order, except for the @code{upper} and @code{lower} classes,
2532 which expand in ascending order. When the @samp{--delete} (@samp{-d})
2533 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
2534 character class can be used in @var{set2}. Otherwise, only the
2535 character classes @code{lower} and @code{upper} are accepted in
2536 @var{set2}, and then only if the corresponding character class
2537 (@code{upper} and @code{lower}, respectively) is specified in the same
2538 relative position in @var{set1}. Doing this specifies case conversion.
2539 The class names are given below; an error results when an invalid class
2551 Horizontal whitespace.
2560 Printable characters, not including space.
2566 Printable characters, including space.
2569 Punctuation characters.
2572 Horizontal or vertical whitespace.
2581 @item Equivalence classes
2582 @cindex equivalence classes
2584 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
2585 equivalent to @var{c}, in no particular order. Equivalence classes are
2586 a relatively recent invention intended to support non-English alphabets.
2587 But there seems to be no standard way to define them or determine their
2588 contents. Therefore, they are not fully implemented in GNU @code{tr};
2589 each character's equivalence class consists only of that character,
2590 which is of no particular use.
2596 @subsection Translating
2598 @cindex translating characters
2600 @code{tr} performs translation when @var{set1} and @var{set2} are
2601 both given and the @samp{--delete} (@samp{-d}) option is not given.
2602 @code{tr} translates each character of its input that is in @var{set1}
2603 to the corresponding character in @var{set2}. Characters not in
2604 @var{set1} are passed through unchanged. When a character appears more
2605 than once in @var{set1} and the corresponding characters in @var{set2}
2606 are not all the same, only the final one is used. For example, these
2607 two commands are equivalent:
2614 A common use of @code{tr} is to convert lowercase characters to
2615 uppercase. This can be done in many ways. Here are three of them:
2618 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
2620 tr '[:lower:]' '[:upper:]'
2623 When @code{tr} is performing translation, @var{set1} and @var{set2}
2624 typically have the same length. If @var{set1} is shorter than
2625 @var{set2}, the extra characters at the end of @var{set2} are ignored.
2627 On the other hand, making @var{set1} longer than @var{set2} is not
2628 portable; @sc{POSIX.2} says that the result is undefined. In this situation,
2629 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
2630 the last character of @var{set2} as many times as necessary. System V
2631 @code{tr} truncates @var{set1} to the length of @var{set2}.
2633 By default, GNU @code{tr} handles this case like BSD @code{tr}. When
2634 the @samp{--truncate-set1} (@samp{-t}) option is given, GNU @code{tr}
2635 handles this case like the System V @code{tr} instead. This option is
2636 ignored for operations other than translation.
2638 Acting like System V @code{tr} in this case breaks the relatively common
2642 tr -cs A-Za-z0-9 '\012'
2646 because it converts only zero bytes (the first element in the
2647 complement of @var{set1}), rather than all non-alphanumerics, to
2652 @subsection Squeezing repeats and deleting
2654 @cindex squeezing repeat characters
2655 @cindex deleting characters
2657 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
2658 removes any input characters that are in @var{set1}.
2660 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
2661 @code{tr} replaces each input sequence of a repeated character that
2662 is in @var{set1} with a single occurrence of that character.
2664 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
2665 first performs any deletions using @var{set1}, then squeezes repeats
2666 from any remaining characters using @var{set2}.
2668 The @samp{--squeeze-repeats} option may also be used when translating,
2669 in which case @code{tr} first performs translation, then squeezes
2670 repeats from any remaining characters using @var{set2}.
2672 Here are some examples to illustrate various combinations of options:
2677 Remove all zero bytes:
2684 Put all words on lines by themselves. This converts all
2685 non-alphanumeric characters to newlines, then squeezes each string
2686 of repeated newlines into a single newline:
2689 tr -cs '[a-zA-Z0-9]' '[\n*]'
2693 Convert each sequence of repeated newlines to a single newline:
2702 @node Warnings in tr
2703 @subsection Warning messages
2705 @vindex POSIXLY_CORRECT
2706 Setting the environment variable @code{POSIXLY_CORRECT} turns off the
2707 following warning and error messages, for strict compliance with
2708 @sc{POSIX.2}. Otherwise, the following diagnostics are issued:
2713 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
2714 is not, and @var{set2} is given, GNU @code{tr} by default prints
2715 a usage message and exits, because @var{set2} would not be used.
2716 The @sc{POSIX} specification says that @var{set2} must be ignored in
2717 this case. Silently ignoring arguments is a bad idea.
2720 When an ambiguous octal escape is given. For example, @samp{\400}
2721 is actually @samp{\40} followed by the digit @samp{0}, because the
2722 value 400 octal does not fit into a single byte.
2726 GNU @code{tr} does not provide complete BSD or System V compatibility.
2727 For example, it is impossible to disable interpretation of the @sc{POSIX}
2728 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}. Also, GNU
2729 @code{tr} does not delete zero bytes automatically, unlike traditional
2730 Unix versions, which provide no way to preserve zero bytes.
2733 @node expand invocation
2734 @section @code{expand}: Convert tabs to spaces
2737 @cindex tabs to spaces, converting
2738 @cindex converting tabs to spaces
2740 @code{expand} writes the contents of each given @var{file}, or standard
2741 input if none are given or for a @var{file} of @samp{-}, to standard
2742 output, with tab characters converted to the appropriate number of
2746 expand [@var{option}]@dots{} [@var{file}]@dots{}
2749 By default, @code{expand} converts all tabs to spaces. It preserves
2750 backspace characters in the output; they decrement the column count for
2751 tab calculations. The default action is equivalent to @samp{-8} (set
2752 tabs every 8 columns).
2754 The program accepts the following options. Also see @ref{Common options}.
2758 @item -@var{tab1}[,@var{tab2}]@dots{}
2759 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2760 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2764 @cindex tabstops, setting
2765 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2766 (default is 8). Otherwise, set the tabs at columns @var{tab1},
2767 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
2768 last tabstop given with single spaces. If the tabstops are specified
2769 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2770 blanks as well as by commas.
2776 @cindex initial tabs, converting
2777 Only convert initial tabs (those that precede all non-space or non-tab
2778 characters) on each line to spaces.
2783 @node unexpand invocation
2784 @section @code{unexpand}: Convert spaces to tabs
2788 @code{unexpand} writes the contents of each given @var{file}, or
2789 standard input if none are given or for a @var{file} of @samp{-}, to
2790 standard output, with strings of two or more space or tab characters
2791 converted to as many tabs as possible followed by as many spaces as are
2795 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
2798 By default, @code{unexpand} converts only initial spaces and tabs (those
2799 that precede all non space or tab characters) on each line. It
2800 preserves backspace characters in the output; they decrement the column
2801 count for tab calculations. By default, tabs are set at every 8th
2804 The program accepts the following options. Also see @ref{Common options}.
2808 @item -@var{tab1}[,@var{tab2}]@dots{}
2809 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
2810 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
2814 If only one tab stop is given, set the tabs @var{tab1} spaces apart
2815 instead of the default 8. Otherwise, set the tabs at columns
2816 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
2817 tabs beyond the tabstops given unchanged. If the tabstops are specified
2818 with the @samp{-t} or @samp{--tabs} option, they can be separated by
2819 blanks as well as by commas. This option implies the @samp{-a} option.
2825 Convert all strings of two or more spaces or tabs, not just initial
2833 @node Opening the software toolbox
2834 @chapter Opening the software toolbox
2836 This chapter originally appeared in @cite{Linux Journal}, volume 1,
2837 number 2, in the @cite{What's GNU?} column. It was written by Arnold
2841 * Toolbox introduction::
2845 * The sort command::
2846 * The uniq command::
2847 * Putting the tools together::
2851 @node Toolbox introduction
2852 @unnumberedsec Toolbox introduction
2854 This month's column is only peripherally related to the GNU Project, in
2855 that it describes a number of the GNU tools on your Linux system and how they
2856 might be used. What it's really about is the ``Software Tools'' philosophy
2857 of program development and usage.
2859 The software tools philosophy was an important and integral concept
2860 in the initial design and development of Unix (of which Linux and GNU are
2861 essentially clones). Unfortunately, in the modern day press of
2862 Internetworking and flashy GUIs, it seems to have fallen by the
2863 wayside. This is a shame, since it provides a powerful mental model
2864 for solving many kinds of problems.
2866 Many people carry a Swiss Army knife around in their pants pockets (or
2867 purse). A Swiss Army knife is a handy tool to have: it has several knife
2868 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
2869 a number of other things on it. For the everyday, small miscellaneous jobs
2870 where you need a simple, general purpose tool, it's just the thing.
2872 On the other hand, an experienced carpenter doesn't build a house using
2873 a Swiss Army knife. Instead, he has a toolbox chock full of specialized
2874 tools---a saw, a hammer, a screwdriver, a plane, and so on. And he knows
2875 exactly when and where to use each tool; you won't catch him hammering nails
2876 with the handle of his screwdriver.
2878 The Unix developers at Bell Labs were all professional programmers and trained
2879 computer scientists. They had found that while a one-size-fits-all program
2880 might appeal to a user because there's only one program to use, in practice
2888 difficult to maintain and
2892 difficult to extend to meet new situations.
2895 Instead, they felt that programs should be specialized tools. In short, each
2896 program ``should do one thing well.'' No more and no less. Such programs are
2897 simpler to design, write, and get right---they only do one thing.
2899 Furthermore, they found that with the right machinery for hooking programs
2900 together, that the whole was greater than the sum of the parts. By combining
2901 several special purpose programs, you could accomplish a specific task
2902 that none of the programs was designed for, and accomplish it much more
2903 quickly and easily than if you had to write a special purpose program.
2904 We will see some (classic) examples of this further on in the column.
2905 (An important additional point was that, if necessary, take a detour
2906 and build any software tools you may need first, if you don't already
2907 have something appropriate in the toolbox.)
2909 @node I/O redirection
2910 @unnumberedsec I/O redirection
2912 Hopefully, you are familiar with the basics of I/O redirection in the
2913 shell, in particular the concepts of ``standard input,'' ``standard output,''
2914 and ``standard error''. Briefly, ``standard input'' is a data source, where
2915 data comes from. A program should not need to either know or care if the
2916 data source is a disk file, a keyboard, a magnetic tape, or even a punched
2917 card reader. Similarly, ``standard output'' is a data sink, where data goes
2918 to. The program should neither know nor care where this might be.
2919 Programs that only read their standard input, do something to the data,
2920 and then send it on, are called ``filters'', by analogy to filters in a
2923 With the Unix shell, it's very easy to set up data pipelines:
2926 program_to_create_data | filter1 | .... | filterN > final.pretty.data
2929 We start out by creating the raw data; each filter applies some successive
2930 transformation to the data, until by the time it comes out of the pipeline,
2931 it is in the desired form.
2933 This is fine and good for standard input and standard output. Where does the
2934 standard error come in to play? Well, think about @code{filter1} in
2935 the pipeline above. What happens if it encounters an error in the data it
2936 sees? If it writes an error message to standard output, it will just
2937 disappear down the pipeline into @code{filter2}'s input, and the
2938 user will probably never see it. So programs need a place where they can send
2939 error messages so that the user will notice them. This is standard error,
2940 and it is usually connected to your console or window, even if you have
2941 redirected standard output of your program away from your screen.
2943 For filter programs to work together, the format of the data has to be
2944 agreed upon. The most straightforward and easiest format to use is simply
2945 lines of text. Unix data files are generally just streams of bytes, with
2946 lines delimited by the @sc{ASCII} @sc{LF} (Line Feed) character,
2947 conventionally called a ``newline'' in the Unix literature. (This is
2948 @code{'\n'} if you're a C programmer.) This is the format used by all
2949 the traditional filtering programs. (Many earlier operating systems
2950 had elaborate facilities and special purpose programs for managing
2951 binary data. Unix has always shied away from such things, under the
2952 philosophy that it's easiest to simply be able to view and edit your
2953 data with a text editor.)
2955 OK, enough introduction. Let's take a look at some of the tools, and then
2956 we'll see how to hook them together in interesting ways. In the following
2957 discussion, we will only present those command line options that interest
2958 us. As you should always do, double check your system documentation
2961 @node The who command
2962 @unnumberedsec The @code{who} command
2964 The first program is the @code{who} command. By itself, it generates a
2965 list of the users who are currently logged in. Although I'm writing
2966 this on a single-user system, we'll pretend that several people are
2971 arnold console Jan 22 19:57
2972 miriam ttyp0 Jan 23 14:19(:0.0)
2973 bill ttyp1 Jan 21 09:32(:0.0)
2974 arnold ttyp2 Jan 23 20:48(:0.0)
2977 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
2978 There are three people logged in, and I am logged in twice. On traditional
2979 Unix systems, user names are never more than eight characters long. This
2980 little bit of trivia will be useful later. The output of @code{who} is nice,
2981 but the data is not all that exciting.
2983 @node The cut command
2984 @unnumberedsec The @code{cut} command
2986 The next program we'll look at is the @code{cut} command. This program
2987 cuts out columns or fields of input data. For example, we can tell it
2988 to print just the login name and full name from the @file{/etc/passwd
2989 file}. The @file{/etc/passwd} file has seven fields, separated by
2993 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
2996 To get the first and fifth fields, we would use cut like this:
2999 $ cut -d: -f1,5 /etc/passwd
3002 arnold:Arnold D. Robbins
3003 miriam:Miriam A. Robbins
3007 With the @samp{-c} option, @code{cut} will cut out specific characters
3008 (i.e., columns) in the input lines. This command looks like it might be
3009 useful for data filtering.
3012 @node The sort command
3013 @unnumberedsec The @code{sort} command
3015 Next we'll look at the @code{sort} command. This is one of the most
3016 powerful commands on a Unix-style system; one that you will often find
3017 yourself using when setting up fancy data plumbing. The @code{sort}
3018 command reads and sorts each file named on the command line. It then
3019 merges the sorted data and writes it to standard output. It will read
3020 standard input if no files are given on the command line (thus
3021 making it into a filter). The sort is based on the machine collating
3022 sequence (@sc{ASCII}) or based on user-supplied ordering criteria.
3025 @node The uniq command
3026 @unnumberedsec The @code{uniq} command
3028 Finally (at least for now), we'll look at the @code{uniq} program. When
3029 sorting data, you will often end up with duplicate lines, lines that
3030 are identical. Usually, all you need is one instance of each line.
3031 This is where @code{uniq} comes in. The @code{uniq} program reads its
3032 standard input, which it expects to be sorted. It only prints out one
3033 copy of each duplicated line. It does have several options. Later on,
3034 we'll use the @samp{-c} option, which prints each unique line, preceded
3035 by a count of the number of times that line occurred in the input.
3038 @node Putting the tools together
3039 @unnumberedsec Putting the tools together
3041 Now, let's suppose this is a large BBS system with dozens of users
3042 logged in. The management wants the SysOp to write a program that will
3043 generate a sorted list of logged in users. Furthermore, even if a user
3044 is logged in multiple times, his or her name should only show up in the
3047 The SysOp could sit down with the system documentation and write a C
3048 program that did this. It would take perhaps a couple of hundred lines
3049 of code and about two hours to write it, test it, and debug it.
3050 However, knowing the software toolbox, the SysOp can instead start out
3051 by generating just a list of logged on users:
3061 Next, sort the list:
3064 $ who | cut -c1-8 | sort
3071 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
3074 $ who | cut -c1-8 | sort | uniq
3080 The @code{sort} command actually has a @samp{-u} option that does what
3081 @code{uniq} does. However, @code{uniq} has other uses for which one
3082 cannot substitute @samp{sort -u}.
3084 The SysOp puts this pipeline into a shell script, and makes it available for
3085 all the users on the system:
3088 # cat > /usr/local/bin/listusers
3089 who | cut -c1-8 | sort | uniq
3091 # chmod +x /usr/local/bin/listusers
3094 There are four major points to note here. First, with just four
3095 programs, on one command line, the SysOp was able to save about two
3096 hours worth of work. Furthermore, the shell pipeline is just about as
3097 efficient as the C program would be, and it is much more efficient in
3098 terms of programmer time. People time is much more expensive than
3099 computer time, and in our modern ``there's never enough time to do
3100 everything'' society, saving two hours of programmer time is no mean
3103 Second, it is also important to emphasize that with the
3104 @emph{combination} of the tools, it is possible to do a special
3105 purpose job never imagined by the authors of the individual programs.
3107 Third, it is also valuable to build up your pipeline in stages, as we did here.
3108 This allows you to view the data at each stage in the pipeline, which helps
3109 you acquire the confidence that you are indeed using these tools correctly.
3111 Finally, by bundling the pipeline in a shell script, other users can use
3112 your command, without having to remember the fancy plumbing you set up for
3113 them. In terms of how you run them, shell scripts and compiled programs are
3116 After the previous warm-up exercise, we'll look at two additional, more
3117 complicated pipelines. For them, we need to introduce two more tools.
3119 The first is the @code{tr} command, which stands for ``transliterate.''
3120 The @code{tr} command works on a character-by-character basis, changing
3121 characters. Normally it is used for things like mapping upper case to
3125 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]'
3126 this example has mixed case!
3129 There are several options of interest:
3133 work on the complement of the listed characters, i.e.,
3134 operations apply to characters not in the given set
3137 delete characters in the first set from the output
3140 squeeze repeated characters in the output into just one character.
3143 We will be using all three options in a moment.
3145 The other command we'll look at is @code{comm}. The @code{comm}
3146 command takes two sorted input files as input data, and prints out the
3147 files' lines in three columns. The output columns are the data lines
3148 unique to the first file, the data lines unique to the second file, and
3149 the data lines that are common to both. The @samp{-1}, @samp{-2}, and
3150 @samp{-3} command line options omit the respective columns. (This is
3151 non-intuitive and takes a little getting used to.) For example:
3173 The single dash as a filename tells @code{comm} to read standard input
3174 instead of a regular file.
3176 Now we're ready to build a fancy pipeline. The first application is a word
3177 frequency counter. This helps an author determine if he or she is over-using
3180 The first step is to change the case of all the letters in our input file
3181 to one case. ``The'' and ``the'' are the same word when doing counting.
3184 $ tr '[A-Z]' '[a-z]' < whats.gnu | ...
3187 The next step is to get rid of punctuation. Quoted words and unquoted words
3188 should be treated identically; it's easiest to just get the punctuation out of
3192 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
3195 The second @code{tr} command operates on the complement of the listed
3196 characters, which are all the letters, the digits, the underscore, and
3197 the blank. The @samp{\012} represents the newline character; it has to
3198 be left alone. (The ASCII TAB character should also be included for
3199 good measure in a production script.)
3201 At this point, we have data consisting of words separated by blank space.
3202 The words only contain alphanumeric characters (and the underscore). The
3203 next step is break the data apart so that we have one word per line. This
3204 makes the counting operation much easier, as we will see shortly.
3207 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3208 > tr -s '[ ]' '\012' | ...
3211 This command turns blanks into newlines. The @samp{-s} option squeezes
3212 multiple newline characters in the output into just one. This helps us
3213 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
3214 This is what the shell prints when it notices you haven't finished
3215 typing in all of a command.)
3217 We now have data consisting of one word per line, no punctuation, all one
3218 case. We're ready to count each word:
3221 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3222 > tr -s '[ ]' '\012' | sort | uniq -c | ...
3225 At this point, the data might look something like this:
3238 The output is sorted by word, not by count! What we want is the most
3239 frequently used words first. Fortunately, this is easy to accomplish,
3240 with the help of two more @code{sort} options:
3244 do a numeric sort, not an ASCII one
3247 reverse the order of the sort
3250 The final pipeline looks like this:
3253 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3254 > tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
3263 Whew! That's a lot to digest. Yet, the same principles apply. With six
3264 commands, on two lines (really one long one split for convenience), we've
3265 created a program that does something interesting and useful, in much
3266 less time than we could have written a C program to do the same thing.
3268 A minor modification to the above pipeline can give us a simple spelling
3269 checker! To determine if you've spelled a word correctly, all you have to
3270 do is look it up in a dictionary. If it is not there, then chances are
3271 that your spelling is incorrect. So, we need a dictionary. If you
3272 have the Slackware Linux distribution, you have the file
3273 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
3276 Now, how to compare our file with the dictionary? As before, we generate
3277 a sorted list of words, one per line:
3280 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3281 > tr -s '[ ]' '\012' | sort -u | ...
3284 Now, all we need is a list of words that are @emph{not} in the
3285 dictionary. Here is where the @code{comm} command comes in.
3288 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3289 > tr -s '[ ]' '\012' | sort -u |
3290 > comm -23 - /usr/lib/ispell/ispell.words
3293 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
3294 dictionary (the second file), and lines that are in both files. Lines
3295 only in the first file (standard input, our stream of words), are
3296 words that are not in the dictionary. These are likely candidates for
3297 spelling errors. This pipeline was the first cut at a production
3298 spelling checker on Unix.
3300 There are some other tools that deserve brief mention.
3304 search files for text that matches a regular expression
3307 like @code{grep}, but with more powerful regular expressions
3310 count lines, words, characters
3313 a T-fitting for data pipes, copies data to files and to standard output
3316 the stream editor, an advanced tool
3319 a data manipulation language, another advanced tool
3322 The software tools philosophy also espoused the following bit of
3323 advice: ``Let someone else do the hard part.'' This means, take
3324 something that gives you most of what you need, and then massage it the
3325 rest of the way until it's in the form that you want.
3331 Each program should do one thing well. No more, no less.
3334 Combining programs with appropriate plumbing leads to results where
3335 the whole is greater than the sum of the parts. It also leads to novel
3336 uses of programs that the authors might never have imagined.
3339 Programs should never print extraneous header or trailer data, since these
3340 could get sent on down a pipeline. (A point we didn't mention earlier.)
3343 Let someone else do the hard part.
3346 Know your toolbox! Use each program appropriately. If you don't have an
3347 appropriate tool, build one.
3350 As of this writing, all the programs we've discussed are available via
3351 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
3352 @file{/pub/gnu/textutils-1.9.tar.gz} directory.@footnote{Version 1.9 was
3353 current when this column was written. Check the nearest GNU archive for
3354 the current version.}
3356 None of what I have presented in this column is new. The Software Tools
3357 philosophy was first introduced in the book @cite{Software Tools},
3358 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
3359 0-201-03669-X). This book showed how to write and use software
3360 tools. It was written in 1976, using a preprocessor for FORTRAN named
3361 @code{ratfor} (RATional FORtran). At the time, C was not as ubiquitous
3362 as it is now; FORTRAN was. The last chapter presented a @code{ratfor}
3363 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
3364 awful lot like C; if you know C, you won't have any problem following
3367 In 1981, the book was updated and made available as @cite{Software
3368 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7). Both books
3369 remain in print, and are well worth reading if you're a programmer.
3370 They certainly made a major change in how I view programming.
3372 Initially, the programs in both books were available (on 9-track tape)
3373 from Addison-Wesley. Unfortunately, this is no longer the case,
3374 although you might be able to find copies floating around the Internet.
3375 For a number of years, there was an active Software Tools Users Group,
3376 whose members had ported the original @code{ratfor} programs to essentially
3377 every computer system with a FORTRAN compiler. The popularity of the
3378 group waned in the middle '80s as Unix began to spread beyond universities.
3380 With the current proliferation of GNU code and other clones of Unix programs,
3381 these programs now receive little attention; modern C versions are
3382 much more efficient and do more than these programs do. Nevertheless, as
3383 exposition of good programming style, and evangelism for a still-valuable
3384 philosophy, these books are unparalleled, and I recommend them highly.
3386 Acknowledgment: I would like to express my gratitude to Brian Kernighan
3387 of Bell Labs, the original Software Toolsmith, for reviewing this column.
3399 @c texinfo-column-for-description: 32