3 @setfilename textutils.info
4 @settitle GNU text utilities
12 @c Put everything in one index (arbitrarily chosen to be the concept index).
22 * Text utilities: (textutils). GNU text utilities.
23 * cat: (textutils)cat invocation. Concatenate and write files.
24 * cksum: (textutils)cksum invocation. Print @sc{POSIX} CRC checksum.
25 * comm: (textutils)comm invocation. Compare sorted files by line.
26 * csplit: (textutils)csplit invocation. Split by context.
27 * cut: (textutils)cut invocation. Print selected parts of lines.
28 * expand: (textutils)expand invocation. Convert tabs to spaces.
29 * fmt: (textutils)fmt invocation. Reformat paragraph text.
30 * fold: (textutils)fold invocation. Wrap long input lines.
31 * head: (textutils)head invocation. Output the first part of files.
32 * join: (textutils)join invocation. Join lines on a common field.
33 * md5sum: (textutils)md5sum invocation. Print or check message-digests.
34 * nl: (textutils)nl invocation. Number lines and write files.
35 * od: (textutils)od invocation. Dump files in octal, etc.
36 * paste: (textutils)paste invocation. Merge lines of files.
37 * pr: (textutils)pr invocation. Paginate or columnate files.
38 * ptx: (textutils)ptx invocation. Produce permuted indexes.
39 * sort: (textutils)sort invocation. Sort text files.
40 * split: (textutils)split invocation. Split into fixed-size pieces.
41 * sum: (textutils)sum invocation. Print traditional checksum.
42 * tac: (textutils)tac invocation. Reverse files.
43 * tail: (textutils)tail invocation. Output the last part of files.
44 * tr: (textutils)tr invocation. Translate characters.
45 * unexpand: (textutils)unexpand invocation. Convert spaces to tabs.
46 * uniq: (textutils)uniq invocation. Uniqify files.
47 * wc: (textutils)wc invocation. Byte, word, and line counts.
53 This file documents the GNU text utilities.
55 Copyright (C) 1994, 95, 96 Free Software Foundation, Inc.
57 Permission is granted to make and distribute verbatim copies of
58 this manual provided the copyright notice and this permission notice
59 are preserved on all copies.
62 Permission is granted to process this file through TeX and print the
63 results, provided the printed document carries copying permission
64 notice identical to this one except for the removal of this paragraph
65 (this paragraph not being relevant to the printed manual).
68 Permission is granted to copy and distribute modified versions of this
69 manual under the conditions for verbatim copying, provided that the entire
70 resulting derived work is distributed under the terms of a permission
71 notice identical to this one.
73 Permission is granted to copy and distribute translations of this manual
74 into another language, under the above conditions for modified versions,
75 except that this permission notice may be stated in a translation approved
80 @title GNU @code{textutils}
81 @subtitle A set of text utilities
82 @subtitle for version @value{VERSION}, @value{UPDATED}
83 @author David MacKenzie et al.
86 @vskip 0pt plus 1filll
87 Copyright @copyright{} 1994, 95, 96 Free Software Foundation, Inc.
89 Permission is granted to make and distribute verbatim copies of
90 this manual provided the copyright notice and this permission notice
91 are preserved on all copies.
93 Permission is granted to copy and distribute modified versions of this
94 manual under the conditions for verbatim copying, provided that the entire
95 resulting derived work is distributed under the terms of a permission
96 notice identical to this one.
98 Permission is granted to copy and distribute translations of this manual
99 into another language, under the above conditions for modified versions,
100 except that this permission notice may be stated in a translation approved
107 @top GNU text utilities
109 @cindex text utilities
110 @cindex utilities for text handling
112 This manual documents version @value{VERSION} of the GNU text utilities.
115 * Introduction:: Caveats, overview, and authors.
116 * Common options:: Common options.
117 * Output of entire files:: cat tac nl od
118 * Formatting file contents:: fmt pr fold
119 * Output of parts of files:: head tail split csplit
120 * Summarizing files:: wc sum cksum md5sum
121 * Operating on sorted files:: sort uniq comm ptx
122 * Operating on fields within a line:: cut paste join
123 * Operating on characters:: tr expand unexpand
124 * Opening the software toolbox:: The software tools philosophy.
125 * Index:: General index.
128 --- The Detailed Node Listing ---
130 Output of entire files
132 * cat invocation:: Concatenate and write files.
133 * tac invocation:: Concatenate and write files in reverse.
134 * nl invocation:: Number lines and write files.
135 * od invocation:: Write files in octal or other formats.
137 Formatting file contents
139 * fmt invocation:: Reformat paragraph text.
140 * pr invocation:: Paginate or columnate files for printing.
141 * fold invocation:: Wrap input lines to fit in specified width.
143 Output of parts of files
145 * head invocation:: Output the first part of files.
146 * tail invocation:: Output the last part of files.
147 * split invocation:: Split a file into fixed-size pieces.
148 * csplit invocation:: Split a file into context-determined pieces.
152 * wc invocation:: Print byte, word, and line counts.
153 * sum invocation:: Print checksum and block counts.
154 * cksum invocation:: Print CRC checksum and byte counts.
155 * md5sum invocation:: Print or check message-digests.
157 Operating on sorted files
159 * sort invocation:: Sort text files.
160 * uniq invocation:: Uniqify files.
161 * comm invocation:: Compare two sorted files line by line.
162 * ptx invocation:: Produce a permuted index of file contents.
164 @code{ptx}: Produce permuted indexes
166 * General options in ptx:: Options which affect general program behaviour.
167 * Charset selection in ptx:: Underlying character set considerations.
168 * Input processing in ptx:: Input fields, contexts, and keyword selection.
169 * Output formatting in ptx:: Types of output format, and sizing the fields.
170 * Compatibility in ptx:: The GNU extensions to @code{ptx}
172 Operating on fields within a line
174 * cut invocation:: Print selected parts of lines.
175 * paste invocation:: Merge lines of files.
176 * join invocation:: Join lines on a common field.
178 Operating on characters
180 * tr invocation:: Translate, squeeze, and/or delete characters.
181 * expand invocation:: Convert tabs to spaces.
182 * unexpand invocation:: Convert spaces to tabs.
184 @code{tr}: Translate, squeeze, and/or delete characters
186 * Character sets:: Specifying sets of characters.
187 * Translating:: Changing one characters to another.
188 * Squeezing:: Squeezing repeats and deleting.
189 * Warnings in tr:: Warning messages.
191 Opening the software toolbox
193 * Toolbox introduction:: Toolbox introduction
194 * I/O redirection:: I/O redirection
195 * The who command:: The @code{who} command
196 * The cut command:: The @code{cut} command
197 * The sort command:: The @code{sort} command
198 * The uniq command:: The @code{uniq} command
199 * Putting the tools together:: Putting the tools together
208 @chapter Introduction
212 This manual is incomplete: No attempt is made to explain basic concepts
213 in a way suitable for novices. Thus, if you are interested, please get
214 involved in improving this manual. The entire GNU community will
218 The GNU text utilities are mostly compatible with the @sc{POSIX.2} standard.
220 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
221 @c sh-utils.texi too -- so be sure to keep them consistent.
222 @cindex bugs, reporting
223 Please report bugs to @email{bug-textutils@@gnu.org}. Remember
224 to include the version number, machine architecture, input files, and
225 any other information needed to reproduce the bug: your input, what you
226 expected, what you got, and why it is wrong. Diffs are welcome, but
227 please include a description of the problem as well, since this is
228 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
230 This manual was originally derived from the Unix man pages in the
231 distribution, which were written by David MacKenzie and updated by Jim
232 Meyering. What you are reading now is the authoritative documentation
233 for these utilities; the man pages are no longer being maintained.
234 The original @code{fmt} man page was written by Ross Paterson.
235 Fran@,{c}ois Pinard did the initial conversion to Texinfo format.
236 Karl Berry did the indexing, some reorganization, and editing of the results.
237 Richard Stallman contributed his usual invaluable insights to the
242 @chapter Common options
244 @cindex common options
246 Certain options are available in all these programs. Rather than
247 writing identical descriptions for each of the programs, they are
248 described here. (In fact, every GNU program accepts (or should accept)
251 A few of these programs take arbitrary strings as arguments. In those
252 cases, @samp{--help} and @samp{--version} are taken as these options
253 only if there is one and exactly one command line argument.
260 Print a usage message listing all available options, then exit successfully.
264 @cindex version number, finding
265 Print the version number, then exit successfully.
270 @node Output of entire files
271 @chapter Output of entire files
273 @cindex output of entire files
274 @cindex entire files, output of
276 These commands read and write entire files, possibly transforming them
280 * cat invocation:: Concatenate and write files.
281 * tac invocation:: Concatenate and write files in reverse.
282 * nl invocation:: Number lines and write files.
283 * od invocation:: Write files in octal or other formats.
287 @section @code{cat}: Concatenate and write files
290 @cindex concatenate and write files
291 @cindex copying files
293 @code{cat} copies each @var{file} (@samp{-} means standard input), or
294 standard input if none are given, to standard output. Synopsis:
297 cat [@var{option}] [@var{file}]@dots{}
300 The program accepts the following options. Also see @ref{Common options}.
308 Equivalent to @samp{-vET}.
311 @itemx --number-nonblank
313 @opindex --number-nonblank
314 Number all nonblank output lines, starting with 1.
318 Equivalent to @samp{-vE}.
324 Display a @samp{$} after the end of each line.
330 Number all output lines, starting with 1.
333 @itemx --squeeze-blank
335 @opindex --squeeze-blank
336 @cindex squeezing blank lines
337 Replace multiple adjacent blank lines with a single blank line.
341 Equivalent to @samp{-vT}.
347 Display @key{TAB} characters as @samp{^I}.
351 Ignored; for Unix compatibility.
354 @itemx --show-nonprinting
356 @opindex --show-nonprinting
357 Display control characters except for @key{LFD} and @key{TAB} using
358 @samp{^} notation and precede characters that have the high bit set
365 @section @code{tac}: Concatenate and write files in reverse
368 @cindex reversing files
370 @code{tac} copies each @var{file} (@samp{-} means standard input), or
371 standard input if none are given, to standard output, reversing the
372 records (lines by default) in each separately. Synopsis:
375 tac [@var{option}]@dots{} [@var{file}]@dots{}
378 @dfn{Records} are separated by instances of a string (newline by
379 default). By default, this separator string is attached to the end of
380 the record that it follows in the file.
382 The program accepts the following options. Also see @ref{Common options}.
390 The separator is attached to the beginning of the record that it
391 precedes in the file.
397 Treat the separator string as a regular expression.
399 @item -s @var{separator}
400 @itemx --separator=@var{separator}
403 Use @var{separator} as the record separator, instead of newline.
409 @section @code{nl}: Number lines and write files
412 @cindex numbering lines
413 @cindex line numbering
415 @code{nl} writes each @var{file} (@samp{-} means standard input), or
416 standard input if none are given, to standard output, with line numbers
417 added to some or all of the lines. Synopsis:
420 nl [@var{option}]@dots{} [@var{file}]@dots{}
423 @cindex logical pages, numbering on
424 @code{nl} decomposes its input into (logical) pages; by default, the
425 line number is reset to 1 at the top of each logical page. @code{nl}
426 treats all of the input files as a single document; it does not reset
427 line numbers or logical pages between files.
429 @cindex headers, numbering
430 @cindex body, numbering
431 @cindex footers, numbering
432 A logical page consists of three sections: header, body, and footer.
433 Any of the sections can be empty. Each can be numbered in a different
434 style from the others.
436 The beginnings of the sections of logical pages are indicated in the
437 input file by a line containing exactly one of these delimiter strings:
448 The two characters from which these strings are made can be changed from
449 @samp{\} and @samp{:} via options (see below), but the pattern and
450 length of each string cannot be changed.
452 A section delimiter is replaced by an empty line on output. Any text
453 that comes before the first section delimiter string in the input file
454 is considered to be part of a body section, so @code{nl} treats a
455 file that contains no section delimiters as a single body section.
457 The program accepts the following options. Also see @ref{Common options}.
462 @itemx --body-numbering=@var{style}
464 @opindex --body-numbering
465 Select the numbering style for lines in the body section of each
466 logical page. When a line is not numbered, the current line number
467 is not incremented, but the line number separator character is still
468 prepended to the line. The styles are:
474 number only nonempty lines (default for body),
476 do not number lines (default for header and footer),
478 number only lines that contain a match for @var{regexp}.
482 @itemx --section-delimiter=@var{cd}
484 @opindex --section-delimiter
485 @cindex section delimiters of pages
486 Set the section delimiter characters to @var{cd}; default is
487 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
488 (Remember to protect @samp{\} or other metacharacters from shell
489 expansion with quotes or extra backslashes.)
492 @itemx --footer-numbering=@var{style}
494 @opindex --footer-numbering
495 Analogous to @samp{--body-numbering}.
498 @itemx --header-numbering=@var{style}
500 @opindex --header-numbering
501 Analogous to @samp{--body-numbering}.
503 @item -i @var{number}
504 @itemx --page-increment=@var{number}
506 @opindex --page-increment
507 Increment line numbers by @var{number} (default 1).
509 @item -l @var{number}
510 @itemx --join-blank-lines=@var{number}
512 @opindex --join-blank-lines
513 @cindex empty lines, numbering
514 @cindex blank lines, numbering
515 Consider @var{number} (default 1) consecutive empty lines to be one
516 logical line for numbering, and only number the last one. Where fewer
517 than @var{number} consecutive empty lines occur, do not number them.
518 An empty line is one that contains no characters, not even spaces
521 @item -n @var{format}
522 @itemx --number-format=@var{format}
524 @opindex --number-format
525 Select the line numbering format (default is @code{rn}):
529 @opindex ln @r{format for @code{nl}}
530 left justified, no leading zeros;
532 @opindex rn @r{format for @code{nl}}
533 right justified, no leading zeros;
535 @opindex rz @r{format for @code{nl}}
536 right justified, leading zeros.
542 @opindex --no-renumber
543 Do not reset the line number at the start of a logical page.
545 @item -s @var{string}
546 @itemx --number-separator=@var{string}
548 @opindex --number-separator
549 Separate the line number from the text line in the output with
550 @var{string} (default is @key{TAB}).
552 @item -v @var{number}
553 @itemx --starting-line-number=@var{number}
555 @opindex --starting-line-number
556 Set the initial line number on each logical page to @var{number} (default 1).
558 @item -w @var{number}
559 @itemx --number-width=@var{number}
561 @opindex --number-width
562 Use @var{number} characters for line numbers (default 6).
568 @section @code{od}: Write files in octal or other formats
571 @cindex octal dump of files
572 @cindex hex dump of files
573 @cindex ASCII dump of files
574 @cindex file contents, dumping unambiguously
576 @code{od} writes an unambiguous representation of each @var{file}
577 (@samp{-} means standard input), or standard input if none are given.
581 od [@var{option}]@dots{} [@var{file}]@dots{}
582 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
585 Each line of output consists of the offset in the input, followed by
586 groups of data from the file. By default, @code{od} prints the offset in
587 octal, and each group of file data is two bytes of input printed as a
590 The program accepts the following options. Also see @ref{Common options}.
595 @itemx --address-radix=@var{radix}
597 @opindex --address-radix
598 @cindex radix for file offsets
599 @cindex file offset radix
600 Select the base in which file offsets are printed. @var{radix} can
601 be one of the following:
611 none (do not print offsets).
614 The default is octal.
617 @itemx --skip-bytes=@var{bytes}
619 @opindex --skip-bytes
620 Skip @var{bytes} input bytes before formatting and writing. If
621 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
622 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
623 in decimal. Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
624 by 1024, and @samp{m} by 1048576.
627 @itemx --read-bytes=@var{bytes}
629 @opindex --read-bytes
630 Output at most @var{bytes} bytes of the input. Prefixes and suffixes on
631 @code{bytes} are interpreted as for the @samp{-j} option.
634 @itemx --strings[=@var{n}]
637 @cindex string constants, outputting
638 Instead of the normal output, output only @dfn{string constants}: at
639 least @var{n} (3 by default) consecutive ASCII graphic characters,
640 followed by a null (zero) byte.
643 @itemx --format=@var{type}
646 Select the format in which to output the file data. @var{type} is a
647 string of one or more of the below type indicator characters. If you
648 include more than one type indicator character in a single @var{type}
649 string, or use this option more than once, @code{od} writes one copy
650 of each output line using each of the data types that you specified,
651 in the order that you specified.
653 Adding a trailing ``z'' to any type specification appends a display
654 of the ASCII character representation of the printable characters
655 to the output line generated by the type specification.
661 ASCII character or backslash escape,
674 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
675 newline, and @samp{nul} for a null (zero) byte. Type @code{c} outputs
676 @samp{ }, @samp{\n}, and @code{\0}, respectively.
679 Except for types @samp{a} and @samp{c}, you can specify the number
680 of bytes to use in interpreting each number in the given data type
681 by following the type indicator character with a decimal integer.
682 Alternately, you can specify the size of one of the C compiler's
683 built-in data types by following the type indicator character with
684 one of the following characters. For integers (@samp{d}, @samp{o},
698 For floating point (@code{f}):
710 @itemx --output-duplicates
712 @opindex --output-duplicates
713 Output consecutive lines that are identical. By default, when two or
714 more consecutive output lines would be identical, @code{od} outputs only
715 the first line, and puts just an asterisk on the following line to
716 indicate the elision.
719 @itemx --width[=@var{n}]
722 Dump @code{n} input bytes per output line. This must be a multiple of
723 the least common multiple of the sizes associated with the specified
724 output types. If @var{n} is omitted, the default is 32. If this option
725 is not given at all, the default is 16.
729 The next several options map the old, pre-@sc{POSIX} format specification
730 options to the corresponding @sc{POSIX} format specs. GNU @code{od} accepts
731 any combination of old- and new-style options. Format specification
738 Output as named characters. Equivalent to @samp{-ta}.
742 Output as octal bytes. Equivalent to @samp{-toC}.
746 Output as ASCII characters or backslash escapes. Equivalent to
751 Output as unsigned decimal shorts. Equivalent to @samp{-tu2}.
755 Output as floats. Equivalent to @samp{-tfF}.
759 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
763 Output as decimal shorts. Equivalent to @samp{-td2}.
767 Output as decimal longs. Equivalent to @samp{-td4}.
771 Output as octal shorts. Equivalent to @samp{-to2}.
775 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
779 @opindex --traditional
780 Recognize the pre-POSIX non-option arguments that traditional @code{od}
781 accepted. The following syntax:
784 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
788 can be used to specify at most one file and optional arguments
789 specifying an offset and a pseudo-start address, @var{label}. By
790 default, @var{offset} is interpreted as an octal number specifying how
791 many input bytes to skip before formatting and writing. The optional
792 trailing decimal point forces the interpretation of @var{offset} as a
793 decimal number. If no decimal is specified and the offset begins with
794 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number. If
795 there is a trailing @samp{b}, the number of bytes skipped will be
796 @var{offset} multiplied by 512. The @var{label} argument is interpreted
797 just like @var{offset}, but it specifies an initial pseudo-address. The
798 pseudo-addresses are displayed in parentheses following any normal
804 @node Formatting file contents
805 @chapter Formatting file contents
807 @cindex formatting file contents
809 These commands reformat the contents of files.
812 * fmt invocation:: Reformat paragraph text.
813 * pr invocation:: Paginate or columnate files for printing.
814 * fold invocation:: Wrap input lines to fit in specified width.
819 @section @code{fmt}: Reformat paragraph text
822 @cindex reformatting paragraph text
823 @cindex paragraphs, reformatting
824 @cindex text, reformatting
826 @code{fmt} fills and joins lines to produce output lines of (at most)
827 a given number of characters (75 by default). Synopsis:
830 fmt [@var{option}]@dots{} [@var{file}]@dots{}
833 @code{fmt} reads from the specified @var{file} arguments (or standard
834 input if none are given), and writes to standard output.
836 By default, blank lines, spaces between words, and indentation are
837 preserved in the output; successive input lines with different
838 indentation are not joined; tabs are expanded on input and introduced on
841 @cindex line-breaking
842 @cindex sentences and line-breaking
843 @cindex Knuth, Donald E.
844 @cindex Plass, Michael F.
845 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
846 avoid line breaks after the first word of a sentence or before the last
847 word of a sentence. A @dfn{sentence break} is defined as either the end
848 of a paragraph or a word ending in any of @samp{.?!}, followed by two
849 spaces or end of line, ignoring any intervening parentheses or quotes.
850 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
851 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
852 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
853 and Experience}, 11 (1981), 1119--1184).
855 The program accepts the following options. Also see @ref{Common options}.
860 @itemx --crown-margin
862 @opindex --crown-margin
864 @dfn{Crown margin} mode: preserve the indentation of the first two
865 lines within a paragraph, and align the left margin of each subsequent
866 line with that of the second line.
869 @itemx --tagged-paragraph
871 @opindex --tagged-paragraph
872 @cindex tagged paragraphs
873 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
874 indentation of the first line of a paragraph is the same as the
875 indentation of the second, the first line is treated as a one-line
881 @opindex --split-only
882 Split lines only. Do not join short lines to form longer ones. This
883 prevents sample lines of code, and other such ``formatted'' text from
884 being unduly combined.
887 @itemx --uniform-spacing
889 @opindex --uniform-spacing
890 Uniform spacing. Reduce spacing between words to one space, and spacing
891 between sentences to two spaces.
894 @itemx -w @var{width}
895 @itemx --width=@var{width}
896 @opindex -@var{width}
899 Fill output lines up to @var{width} characters (default 75). @code{fmt}
900 initially tries to make lines about 7% shorter than this, to give it
901 room to balance line lengths.
903 @item -p @var{prefix}
904 @itemx --prefix=@var{prefix}
905 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
906 are subject to formatting. The prefix and any preceding whitespace are
907 stripped for the formatting and then re-attached to each formatted output
908 line. One use is to format certain kinds of program comments, while
909 leaving the code unchanged.
915 @section @code{pr}: Paginate or columnate files for printing
918 @cindex printing, preparing files for
919 @cindex multicolumn output, generating
920 @cindex merging files in parallel
922 @code{pr} writes each @var{file} (@samp{-} means standard input), or
923 standard input if none are given, to standard output, paginating and
924 optionally outputting in multicolumn format; optionally merges all
925 @var{file}s, printing all in parallel, one per column. Synopsis:
928 pr [@var{option}]@dots{} [@var{file}]@dots{}
931 By default, a 5-line header is printed: two blank lines; a line with the
932 date, the file name, and the page count; and two more blank lines. A
933 footer of five blank lines is also printed. With the @samp{-f} option, a
934 3-line header is printed: the leading two blank lines are omitted; no
935 footer used. The default @var{page_length} in both cases is 66 lines.
936 The text line of the header takes up the full @var{page_width} in the
937 form @samp{yy-mm-dd HH:MM string Page nnnn}. String is a centered
940 Form feeds in the input cause page breaks in the output. Multiple form
941 feeds produce empty pages.
943 Columns have equal width, separated by an optional string (default
944 space). Lines will always be truncated to line width (default 72),
945 unless you use the @samp{-j} option. For single column output no line
946 truncation occurs by default. Use @samp{-w} option to truncate lines
949 The program accepts the following options. Also see @ref{Common options}.
953 @item +@var{first_page}[:@var{last_page}]
954 @itemx --pages=@var{first_page}[:@var{last_page}]
955 @opindex +@var{first_page}[:@var{last_page}]
957 Begin printing with page @var{first_page} and stop with
958 @var{last_page}. Missing @samp{:@var{last_page}} implies end of file. While
959 estimating the number of skipped pages each form feed in the input file
960 results in a new page. Page counting with and without
961 @samp{+@var{first_page}} is identical. By default, it starts with the
962 first page of input file (not first page printed). Page numbering may be
963 altered by @samp{-N} option.
966 @itemx --columns=@var{column}
967 @opindex -@var{column}
970 With each single @var{file}, produce @var{column}-column output and
971 print columns down. The column width is automatically estimated from
972 @var{page_width}. This option might well cause some columns to be
973 truncated. The number of lines in the columns on each page will be
974 balanced. @samp{-@var{column}} may not be used with @samp{-m} option.
980 @cindex across columns
981 With each single @var{file}, print columns across rather than down.
982 @var{column} must be greater than one.
985 @itemx --show-control-chars
987 @opindex --show-control-chars
988 Print control characters using hat notation (e.g., @samp{^G}); print
989 other unprintable characters in octal backslash notation. By default,
990 unprintable characters are not changed.
993 @itemx --double-space
995 @opindex --double-space
996 @cindex double spacing
997 Double space the output.
999 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
1000 @itemx --expand-tabs[=@var{in-tabchar}[@var{in-tabwidth}]]
1002 @opindex --expand-tabs
1004 Expand tabs to spaces on input. Optional argument @var{in-tabchar} is
1005 the input tab character (default is @key{TAB}). Second optional
1006 argument @var{in-tabwidth} is the input tab character's width (default
1014 @opindex --form-feed
1015 Use a form feed instead of newlines to separate output pages. Default
1016 page length of 66 lines is not altered. But the number of lines of text
1017 per page changes from 56 to 63 lines.
1020 @item -h @var{HEADER}
1021 @itemx --header=@var{HEADER}
1024 Replace the file name in the header with the centered string
1025 @var{header}. Left-hand-side truncation (marked by a @samp{*}) may occur
1026 if the total header line @samp{yy-mm-dd HH:MM HEADER Page nnnn}
1027 becomes larger than @var{page_width}. @samp{-h ""} prints a blank line
1028 header. Don't use @samp{-h""}. A space between the -h option and the
1029 argument is always peremptory.
1031 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
1032 @itemx --output-tabs[=@var{out-tabchar}[@var{out-tabwidth}]]
1034 @opindex --output-tabs
1036 Replace spaces with tabs on output. Optional argument @var{out-tabchar}
1037 is the output tab character (default is @key{TAB}). Second optional
1038 argument @var{out-tabwidth} is the output tab character's width (default
1044 @opindex --join-lines
1045 Merge lines of full length. Used together with the column options
1046 @samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}. Turns off
1047 @samp{-w} line truncation; no column alignment used; may be used with
1048 @samp{-s[@var{separator}]}.
1051 @item -l @var{page_length}
1052 @itemx --length=@var{page_length}
1055 Set the page length to @var{page_length} (default 66) lines. If
1056 @var{page_length} is less than or equal 10 (and <= 3 with @samp{-f}),
1057 the headers and footers are omitted, and all form feeds set in input
1058 files are eliminated, as if the @samp{-T} option had been given.
1064 Merge and print all @var{file}s in parallel, one in each column. If a
1065 line is too long to fit in a column, it is truncated (but see
1066 @samp{-j}). @samp{-s[@var{separator}]} may be used. Empty pages in some
1067 @var{file}s (form feeds set) produce empty columns, still marked by
1068 @var{separator}. Completely empty common pages show no separators or
1069 line numbers. The default header becomes
1070 @samp{yy-mm-dd HH:MM <blanks> Page nnnn}; may be used with
1071 @samp{-h @var{header}} to fill up the middle part.
1074 @item -n[@var{number-separator}[@var{digits}]]
1075 @itemx --number-lines[=@var{number-separator}[@var{digits}]]
1077 @opindex --number-lines
1078 Precede each column with a line number; with parallel @var{file}s
1079 (@samp{-m}), precede only each line with a line number. Optional argument
1080 @var{number-separator} is the character to print after each number
1081 (default is @key{TAB}). Optional argument @var{digits} is the number of
1082 digits per line number (default is 5). Default line counting starts with
1083 first line of the input file (not with the first line printed, see
1086 @item -N @var{line_number}
1087 @itemx --first-line-number=@var{line_number}
1089 @opindex --first-line-number
1090 Start line counting with no. @var{line_number} at first line of first
1094 @itemx --indent=@var{n}
1097 @cindex indenting lines
1099 Indent each line with @var{n} (default is zero) spaces wide, i.e., set
1100 the left margin. The total page width is @var{n} plus the width set
1101 with the @samp{-w} option.
1104 @itemx --no-file-warnings
1106 @opindex --no-file-warnings
1107 Do not print a warning message when an argument @var{file} cannot be
1108 opened. (The exit status will still be nonzero, however.)
1110 @item -s[@var{separator}]
1111 @itemx --separator[=@var{separator}]
1113 @opindex --separator
1114 Separate columns by a string @var{separator}. Don't use
1115 @samp{-s @var{separator}}, no space between flag and argument. If this
1116 option is omitted altogether, the default is @key{TAB} together with
1117 @samp{-j} option and space otherwise (same as @samp{-s" "}). With
1118 @samp{-s} only, no separator is used (same as @samp{-s""}). @samp{-s}
1119 does not affect line truncation or column alignment.
1122 @itemx --omit-header
1124 @opindex --omit-header
1125 Do not print the usual header [and footer] on each page, and do not fill
1126 out the bottoms of pages (with blank lines or a form feed). No page
1127 structure is produced, but retain form feeds set in the input files. The
1128 predefined page layout is not changed. @samp{-t} or @samp{-T} may be
1129 useful together with other options; e.g.: @samp{-t -e4}, expand
1130 @key{TAB} in the input file to 4 spaces but do not do any other changes.
1131 Use of @samp{-t} overrides @samp{-h}.
1134 @itemx --omit-pagination
1136 @opindex --omit-pagination
1137 Do not print header [and footer]. In addition eliminate all form feeds
1138 set in the input files.
1141 @itemx --show-nonprinting
1143 @opindex --show-nonprinting
1144 Print unprintable characters in octal backslash notation.
1146 @item -w @var{page_width}
1147 @itemx --width=@var{page_width}
1150 Set the page width to @var{page_width} (default 72) characters.
1151 With/without @samp{-w}, header lines are always truncated to
1152 @var{page_width} characters. With @samp{-w}, text lines are truncated,
1153 unless @samp{-j} is used. Without @samp{-w} together with one of the
1154 column options @samp{-@var{column}}, @samp{-a -@var{column}} or
1155 @samp{-m}, default truncation of text lines to 72 characters is used.
1156 Without @samp{-w} and without any of the column options, no line
1157 truncation is used. That's equivalent to @samp{-w 72 -j}.
1162 @node fold invocation
1163 @section @code{fold}: Wrap input lines to fit in specified width
1166 @cindex wrapping long input lines
1167 @cindex folding long input lines
1169 @code{fold} writes each @var{file} (@samp{-} means standard input), or
1170 standard input if none are given, to standard output, breaking long
1174 fold [@var{option}]@dots{} [@var{file}]@dots{}
1177 By default, @code{fold} breaks lines wider than 80 columns. The output
1178 is split into as many lines as necessary.
1180 @cindex screen columns
1181 @code{fold} counts screen columns by default; thus, a tab may count more
1182 than one column, backspace decreases the column count, and carriage
1183 return sets the column to zero.
1185 The program accepts the following options. Also see @ref{Common options}.
1193 Count bytes rather than columns, so that tabs, backspaces, and carriage
1194 returns are each counted as taking up one column, just like other
1201 Break at word boundaries: the line is broken after the last blank before
1202 the maximum line length. If the line contains no such blanks, the line
1203 is broken at the maximum line length as usual.
1205 @item -w @var{width}
1206 @itemx --width=@var{width}
1209 Use a maximum line length of @var{width} columns instead of 80.
1214 @node Output of parts of files
1215 @chapter Output of parts of files
1217 @cindex output of parts of files
1218 @cindex parts of files, output of
1220 These commands output pieces of the input.
1223 * head invocation:: Output the first part of files.
1224 * tail invocation:: Output the last part of files.
1225 * split invocation:: Split a file into fixed-size pieces.
1226 * csplit invocation:: Split a file into context-determined pieces.
1229 @node head invocation
1230 @section @code{head}: Output the first part of files
1233 @cindex initial part of files, outputting
1234 @cindex first part of files, outputting
1236 @code{head} prints the first part (10 lines by default) of each
1237 @var{file}; it reads from standard input if no files are given or
1238 when given a @var{file} of @samp{-}. Synopses:
1241 head [@var{option}]@dots{} [@var{file}]@dots{}
1242 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1245 If more than one @var{file} is specified, @code{head} prints a
1246 one-line header consisting of
1248 ==> @var{file name} <==
1251 before the output for each @var{file}.
1253 @code{head} accepts two option formats: the new one, in which numbers
1254 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1255 the number precedes any option letters (@samp{-1q}).
1257 The program accepts the following options. Also see @ref{Common options}.
1261 @item -@var{count}@var{options}
1262 @opindex -@var{count}
1263 This option is only recognized if it is specified first. @var{count} is
1264 a decimal number optionally followed by a size letter (@samp{b},
1265 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1266 or other option letters (@samp{cqv}).
1268 @item -c @var{bytes}
1269 @itemx --bytes=@var{bytes}
1272 Print the first @var{bytes} bytes, instead of initial lines. Appending
1273 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1277 @itemx --lines=@var{n}
1280 Output the first @var{n} lines.
1288 Never print file name headers.
1294 Always print file name headers.
1299 @node tail invocation
1300 @section @code{tail}: Output the last part of files
1303 @cindex last part of files, outputting
1305 @code{tail} prints the last part (10 lines by default) of each
1306 @var{file}; it reads from standard input if no files are given or
1307 when given a @var{file} of @samp{-}. Synopses:
1310 tail [@var{option}]@dots{} [@var{file}]@dots{}
1311 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1312 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1315 If more than one @var{file} is specified, @code{tail} prints a
1316 one-line header consisting of
1318 ==> @var{file name} <==
1321 before the output for each @var{file}.
1323 @cindex BSD @code{tail}
1324 GNU @code{tail} can output any amount of data (some other versions of
1325 @code{tail} cannot). It also has no @samp{-r} option (print in
1326 reverse), since reversing a file is really a different job from printing
1327 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1328 only reverse files that are at most as large as its buffer, which is
1329 typically 32k. A more reliable and versatile way to reverse files is
1330 the GNU @code{tac} command.
1332 @code{tail} accepts two option formats: the new one, in which numbers
1333 are arguments to the options (@samp{-n 1}), and the old one, in which
1334 the number precedes any option letters (@samp{-1} or @samp{+1}).
1336 If any option-argument is a number @var{n} starting with a @samp{+},
1337 @code{tail} begins printing with the @var{n}th item from the start of
1338 each file, instead of from the end.
1340 The program accepts the following options. Also see @ref{Common options}.
1346 @opindex -@var{count}
1347 @opindex +@var{count}
1348 This option is only recognized if it is specified first. @var{count} is
1349 a decimal number optionally followed by a size letter (@samp{b},
1350 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1351 or other option letters (@samp{cfqv}).
1353 @item -c @var{bytes}
1354 @itemx --bytes=@var{bytes}
1357 Output the last @var{bytes} bytes, instead of final lines. Appending
1358 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1365 @cindex growing files
1366 Loop forever trying to read more characters at the end of the file,
1367 presumably because the file is growing. Ignored if reading from a pipe.
1368 If more than one file is given, @code{tail} prints a header whenever it
1369 gets output from a different file, to indicate which file that output is
1373 @itemx --lines=@var{n}
1376 Output the last @var{n} lines.
1384 Never print file name headers.
1390 Always print file name headers.
1395 @node split invocation
1396 @section @code{split}: Split a file into fixed-size pieces
1399 @cindex splitting a file into pieces
1400 @cindex pieces, splitting a file into
1402 @code{split} creates output files containing consecutive sections of
1403 @var{input} (standard input if none is given or @var{input} is
1404 @samp{-}). Synopsis:
1407 split [@var{option}] [@var{input} [@var{prefix}]]
1410 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1411 left over for the last section), into each output file.
1413 @cindex output file name prefix
1414 The output files' names consist of @var{prefix} (@samp{x} by default)
1415 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1416 that concatenating the output files in sorted order by file name produces
1417 the original input file. (If more than 676 output files are required,
1418 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1420 The program accepts the following options. Also see @ref{Common options}.
1425 @itemx -l @var{lines}
1426 @itemx --lines=@var{lines}
1429 Put @var{lines} lines of @var{input} into each output file.
1431 @item -b @var{bytes}
1432 @itemx --bytes=@var{bytes}
1435 Put the first @var{bytes} bytes of @var{input} into each output file.
1436 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1437 @samp{m} by 1048576.
1439 @item -C @var{bytes}
1440 @itemx --line-bytes=@var{bytes}
1442 @opindex --line-bytes
1443 Put into each output file as many complete lines of @var{input} as
1444 possible without exceeding @var{bytes} bytes. For lines longer than
1445 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1446 less than @var{bytes} bytes of the line are left, then continue
1447 normally. @var{bytes} has the same format as for the @samp{--bytes}
1452 Write a diagnostic to standard error just before each output file is opened.
1457 @node csplit invocation
1458 @section @code{csplit}: Split a file into context-determined pieces
1461 @cindex context splitting
1462 @cindex splitting a file into pieces by context
1464 @code{csplit} creates zero or more output files containing sections of
1465 @var{input} (standard input if @var{input} is @samp{-}). Synopsis:
1468 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1471 The contents of the output files are determined by the @var{pattern}
1472 arguments, as detailed below. An error occurs if a @var{pattern}
1473 argument refers to a nonexistent line of the input file (e.g., if no
1474 remaining line matches a given regular expression). After every
1475 @var{pattern} has been matched, any remaining input is copied into one
1478 By default, @code{csplit} prints the number of bytes written to each
1479 output file after it has been created.
1481 The types of pattern arguments are:
1486 Create an output file containing the input up to but not including line
1487 @var{n} (a positive integer). If followed by a repeat count, also
1488 create an output file containing the next @var{line} lines of the input
1489 file once for each repeat.
1491 @item /@var{regexp}/[@var{offset}]
1492 Create an output file containing the current line up to (but not
1493 including) the next line of the input file that contains a match for
1494 @var{regexp}. The optional @var{offset} is a @samp{+} or @samp{-}
1495 followed by a positive integer. If it is given, the input up to the
1496 matching line plus or minus @var{offset} is put into the output file,
1497 and the line after that begins the next section of input.
1499 @item %@var{regexp}%[@var{offset}]
1500 Like the previous type, except that it does not create an output
1501 file, so that section of the input file is effectively ignored.
1503 @item @{@var{repeat-count}@}
1504 Repeat the previous pattern @var{repeat-count} additional
1505 times. @var{repeat-count} can either be a positive integer or an
1506 asterisk, meaning repeat as many times as necessary until the input is
1511 The output files' names consist of a prefix (@samp{xx} by default)
1512 followed by a suffix. By default, the suffix is an ascending sequence
1513 of two-digit decimal numbers from @samp{00} and up to @samp{99}. In any
1514 case, concatenating the output files in sorted order by filename
1515 produces the original input file.
1517 By default, if @code{csplit} encounters an error or receives a hangup,
1518 interrupt, quit, or terminate signal, it removes any output files
1519 that it has created so far before it exits.
1521 The program accepts the following options. Also see @ref{Common options}.
1525 @item -f @var{prefix}
1526 @itemx --prefix=@var{prefix}
1529 @cindex output file name prefix
1530 Use @var{prefix} as the output file name prefix.
1532 @item -b @var{suffix}
1533 @itemx --suffix=@var{suffix}
1536 @cindex output file name suffix
1537 Use @var{suffix} as the output file name suffix. When this option is
1538 specified, the suffix string must include exactly one
1539 @code{printf(3)}-style conversion specification, possibly including
1540 format specification flags, a field width, a precision specifications,
1541 or all of these kinds of modifiers. The format letter must convert a
1542 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1543 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed. The
1544 entire @var{suffix} is given (with the current output file number) to
1545 @code{sprintf(3)} to form the file name suffixes for each of the
1546 individual output files in turn. If this option is used, the
1547 @samp{--digits} option is ignored.
1549 @item -n @var{digits}
1550 @itemx --digits=@var{digits}
1553 Use output file names containing numbers that are @var{digits} digits
1554 long instead of the default 2.
1559 @opindex --keep-files
1560 Do not remove output files when errors are encountered.
1563 @itemx --elide-empty-files
1565 @opindex --elide-empty-files
1566 Suppress the generation of zero-length output files. (In cases where
1567 the section delimiters of the input file are supposed to mark the first
1568 lines of each of the sections, the first output file will generally be a
1569 zero-length file unless you use this option.) The output file sequence
1570 numbers always run consecutively starting from 0, even when this option
1581 Do not print counts of output file sizes.
1586 @node Summarizing files
1587 @chapter Summarizing files
1589 @cindex summarizing files
1591 These commands generate just a few numbers representing entire
1595 * wc invocation:: Print byte, word, and line counts.
1596 * sum invocation:: Print checksum and block counts.
1597 * cksum invocation:: Print CRC checksum and byte counts.
1598 * md5sum invocation:: Print or check message-digests.
1603 @section @code{wc}: Print byte, word, and line counts
1610 @code{wc} counts the number of bytes, whitespace-separated words, and
1611 newlines in each given @var{file}, or standard input if none are given
1612 or for a @var{file} of @samp{-}. Synopsis:
1615 wc [@var{option}]@dots{} [@var{file}]@dots{}
1618 @cindex total counts
1619 @code{wc} prints one line of counts for each file, and if the file was
1620 given as an argument, it prints the file name following the counts. If
1621 more than one @var{file} is given, @code{wc} prints a final line
1622 containing the cumulative counts, with the file name @file{total}. The
1623 counts are printed in this order: newlines, words, bytes.
1625 By default, @code{wc} prints all three counts. Options can specify
1626 that only certain counts be printed. Options do not undo others
1627 previously given, so
1634 prints both the byte counts and the word counts.
1636 With the @code{--max-line-length} option, @code{wc} prints the length
1637 of the longest line per file, and if there is more than one file it
1638 prints the maximum (not the sum) of those lengths.
1640 The program accepts the following options. Also see @ref{Common options}.
1650 Print only the byte counts.
1656 Print only the word counts.
1662 Print only the newline counts.
1665 @itemx --max-line-length
1667 @opindex --max-line-length
1668 Print only the maximum line lengths.
1673 @node sum invocation
1674 @section @code{sum}: Print checksum and block counts
1677 @cindex 16-bit checksum
1678 @cindex checksum, 16-bit
1680 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1681 standard input if none are given or for a @var{file} of @samp{-}. Synopsis:
1684 sum [@var{option}]@dots{} [@var{file}]@dots{}
1687 @code{sum} prints the checksum for each @var{file} followed by the
1688 number of blocks in the file (rounded up). If more than one @var{file}
1689 is given, file names are also printed (by default). (With the
1690 @samp{--sysv} option, corresponding file name are printed when there is
1691 at least one file argument.)
1693 By default, GNU @code{sum} computes checksums using an algorithm
1694 compatible with BSD @code{sum} and prints file sizes in units of
1697 The program accepts the following options. Also see @ref{Common options}.
1703 @cindex BSD @code{sum}
1704 Use the default (BSD compatible) algorithm. This option is included for
1705 compatibility with the System V @code{sum}. Unless @samp{-s} was also
1706 given, it has no effect.
1712 @cindex System V @code{sum}
1713 Compute checksums using an algorithm compatible with System V
1714 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1718 @code{sum} is provided for compatibility; the @code{cksum} program (see
1719 next section) is preferable in new applications.
1722 @node cksum invocation
1723 @section @code{cksum}: Print CRC checksum and byte counts
1726 @cindex cyclic redundancy check
1727 @cindex CRC checksum
1729 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1730 given @var{file}, or standard input if none are given or for a
1731 @var{file} of @samp{-}. Synopsis:
1734 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1737 @code{cksum} prints the CRC checksum for each file along with the number
1738 of bytes in the file, and the filename unless no arguments were given.
1740 @code{cksum} is typically used to ensure that files
1741 transferred by unreliable means (e.g., netnews) have not been corrupted,
1742 by comparing the @code{cksum} output for the received files with the
1743 @code{cksum} output for the original files (typically given in the
1746 The CRC algorithm is specified by the @sc{POSIX.2} standard. It is not
1747 compatible with the BSD or System V @code{sum} algorithms (see the
1748 previous section); it is more robust.
1750 The only options are @samp{--help} and @samp{--version}. @xref{Common
1754 @node md5sum invocation
1755 @section @code{md5sum}: Print or check message-digests
1758 @cindex 128-bit checksum
1759 @cindex checksum, 128-bit
1760 @cindex fingerprint, 128-bit
1761 @cindex message-digest, 128-bit
1763 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1764 @dfn{message-digest}) for each specified @var{file}.
1765 If a @var{file} is specified as @samp{-} or if no files are given
1766 @code{md5sum} computes the checksum for the standard input.
1767 @code{md5sum} can also determine whether a file and checksum are
1768 consistent. Synopses:
1771 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1772 md5sum [@var{option}]@dots{} --check [@var{file}]
1775 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1776 indicating a binary or text input file, and the filename.
1777 If @var{file} is omitted or specified as @samp{-}, standard input is read.
1779 The program accepts the following options. Also see @ref{Common options}.
1787 @cindex binary input files
1788 Treat all input files as binary. This option has no effect on Unix
1789 systems, since they don't distinguish between binary and text files.
1790 This option is useful on systems that have different internal and
1791 external character representations.
1795 Read filenames and checksum information from the single @var{file}
1796 (or from stdin if no @var{file} was specified) and report whether
1797 each named file and the corresponding checksum data are consistent.
1798 The input to this mode of @code{md5sum} is usually the output of
1799 a prior, checksum-generating run of @samp{md5sum}.
1800 Each valid line of input consists of an MD5 checksum, a binary/text
1801 flag, and then a filename.
1802 Binary files are marked with @samp{*}, text with @samp{ }.
1803 For each such line, @code{md5sum} reads the named file and computes its
1804 MD5 checksum. Then, if the computed message digest does not match the
1805 one on the line with the filename, the file is noted as having
1806 failed the test. Otherwise, the file passes the test.
1807 By default, for each valid line, one line is written to standard
1808 output indicating whether the named file passed the test.
1809 After all checks have been performed, if there were any failures,
1810 a warning is issued to standard error.
1811 Use the @samp{--status} option to inhibit that output.
1812 If any listed file cannot be opened or read, if any valid line has
1813 an MD5 checksum inconsistent with the associated file, or if no valid
1814 line is found, @code{md5sum} exits with nonzero status. Otherwise,
1815 it exits successfully.
1819 @cindex verifying MD5 checksums
1820 This option is useful only when verifying checksums.
1821 When verifying checksums, don't generate the default one-line-per-file
1822 diagnostic and don't output the warning summarizing any failures.
1823 Failures to open or read a file still evoke individual diagnostics to
1825 If all listed files are readable and are consistent with the associated
1826 MD5 checksums, exit successfully. Otherwise exit with a status code
1827 indicating there was a failure.
1833 @cindex text input files
1834 Treat all input files as text files. This is the reverse of
1841 @cindex verifying MD5 checksums
1842 When verifying checksums, warn about improperly formatted MD5 checksum lines.
1843 This option is useful only if all but a few lines in the checked input
1849 @node Operating on sorted files
1850 @chapter Operating on sorted files
1852 @cindex operating on sorted files
1853 @cindex sorted files, operations on
1855 These commands work with (or produce) sorted files.
1858 * sort invocation:: Sort text files.
1859 * uniq invocation:: Uniqify files.
1860 * comm invocation:: Compare two sorted files line by line.
1865 @node sort invocation
1866 @section @code{sort}: Sort text files
1869 @cindex sorting files
1871 @code{sort} sorts, merges, or compares all the lines from the given
1872 files, or standard input if none are given or for a @var{file} of
1873 @samp{-}. By default, @code{sort} writes the results to standard
1877 sort [@var{option}]@dots{} [@var{file}]@dots{}
1880 @code{sort} has three modes of operation: sort (the default), merge,
1881 and check for sortedness. The following options change the operation
1888 @cindex checking for sortedness
1889 Check whether the given files are already sorted: if they are not all
1890 sorted, print an error message and exit with a status of 1.
1891 Otherwise, exit successfully.
1895 @cindex merging sorted files
1896 Merge the given files by sorting them as a group. Each input file must
1897 always be individually sorted. It always works to sort instead of
1898 merge; merging is provided because it is faster, in the case where it
1903 A pair of lines is compared as follows: if any key fields have been
1904 specified, @code{sort} compares each pair of fields, in the order
1905 specified on the command line, according to the associated ordering
1906 options, until a difference is found or no fields are left.
1908 If any of the global options @samp{Mbdfinr} are given but no key fields
1909 are specified, @code{sort} compares the entire lines according to the
1912 Finally, as a last resort when all keys compare equal (or if no
1913 ordering options were specified at all), @code{sort} compares the lines
1914 byte by byte in machine collating sequence. The last resort comparison
1915 honors the @samp{-r} global option. The @samp{-s} (stable) option
1916 disables this last-resort comparison so that lines in which all fields
1917 compare equal are left in their original relative order. If no fields
1918 or global options are specified, @samp{-s} has no effect.
1920 GNU @code{sort} (as specified for all GNU utilities) has no limits on
1921 input line length or restrictions on bytes allowed within lines. In
1922 addition, if the final byte of an input file is not a newline, GNU
1923 @code{sort} silently supplies one.
1925 Upon any error, @code{sort} exits with a status of @samp{2}.
1928 If the environment variable @code{TMPDIR} is set, @code{sort} uses its
1929 value as the directory for temporary files instead of @file{/tmp}. The
1930 @samp{-T @var{tempdir}} option in turn overrides the environment
1933 The following options affect the ordering of output lines. They may be
1934 specified globally or as part of a specific key field. If no key
1935 fields are specified, global options apply to comparison of entire
1936 lines; otherwise the global options are inherited by key fields that do
1937 not specify any special options of their own.
1943 @cindex blanks, ignoring leading
1944 Ignore leading blanks when finding sort keys in each line.
1948 @cindex phone directory order
1949 @cindex telephone directory order
1950 Sort in @dfn{phone directory} order: ignore all characters except
1951 letters, digits and blanks when sorting.
1955 @cindex case folding
1956 Fold lowercase characters into the equivalent uppercase characters when
1957 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
1961 @cindex general numeric sort
1962 Sort numerically, but use strtod(3) to arrive at the numeric values.
1963 This allows floating point numbers to be specified in scientific notation,
1964 like @code{1.0e-34} and @code{10e100}. Use this option only if there
1965 is no alternative; it is much slower than @samp{-n} and numbers with
1966 too many significant digits will be compared as if they had been
1967 truncated. In addition, numbers outside the range of representable
1968 double precision floating point numbers are treated as if they were
1969 zeroes; overflow and underflow are not reported.
1973 @cindex unprintable characters, ignoring
1974 Ignore characters outside the printable ASCII range 040-0176 octal
1975 (inclusive) when sorting.
1979 @cindex months, sorting by
1980 An initial string, consisting of any amount of whitespace, followed
1981 by three letters abbreviating a month name, is folded to UPPER case and
1982 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
1983 Invalid names compare low to valid names.
1987 @cindex numeric sort
1988 Sort numerically: the number begins each line; specifically, it consists
1989 of optional whitespace, an optional @samp{-} sign, and zero or more
1990 digits, optionally followed by a decimal point and zero or more digits.
1992 @code{sort -n} uses what might be considered an unconventional method
1993 to compare strings representing floating point numbers. Rather than
1994 first converting each string to the C @code{double} type and then
1995 comparing those values, sort aligns the decimal points in the two
1996 strings and compares the strings a character at a time. One benefit
1997 of using this approach is its speed. In practice this is much more
1998 efficient than performing the two corresponding string-to-double (or even
1999 string-to-integer) conversions and then comparing doubles. In addition,
2000 there is no corresponding loss of precision. Converting each string to
2001 @code{double} before comparison would limit precision to about 16 digits
2004 Neither a leading @samp{+} nor exponential notation is recognized.
2005 To compare such strings numerically, use the @samp{-g} option.
2009 @cindex reverse sorting
2010 Reverse the result of comparison, so that lines with greater key values
2011 appear earlier in the output instead of later.
2019 @item -o @var{output-file}
2021 @cindex overwriting of input, allowed
2022 Write output to @var{output-file} instead of standard output.
2023 If @var{output-file} is one of the input files, @code{sort} copies
2024 it to a temporary file before sorting and writing the output to
2027 @item -t @var{separator}
2029 @cindex field separator character
2030 Use character @var{separator} as the field separator when finding the
2031 sort keys in each line. By default, fields are separated by the empty
2032 string between a non-whitespace character and a whitespace character.
2033 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
2034 into fields @w{@samp{ foo}} and @w{@samp{ bar}}. The field separator is
2035 not considered to be part of either the field preceding or the field
2040 @cindex uniqifying output
2041 For the default case or the @samp{-m} option, only output the first
2042 of a sequence of lines that compare equal. For the @samp{-c} option,
2043 check that no pair of consecutive lines compares equal.
2045 @item -k @var{pos1}[,@var{pos2}]
2048 The recommended, @sc{POSIX}, option for specifying a sort field. The field
2049 consists of the line between @var{pos1} and @var{pos2} (or the end of
2050 the line, if @var{pos2} is omitted), inclusive. Fields and character
2051 positions are numbered starting with 1. See below.
2055 @cindex sort zero-terminated lines
2056 Treat the input as a set of lines, each terminated by a zero byte (@sc{ASCII}
2057 @sc{NUL} (Null) character) instead of a @sc{ASCII} @sc{LF} (Line Feed.)
2058 This option can be useful in conjunction with @samp{perl -0} or
2059 @samp{find -print0} and @samp{xargs -0} which do the same in order to
2060 reliably handle arbitrary pathnames (even those which contain Line Feed
2063 @item +@var{pos1}[-@var{pos2}]
2064 The obsolete, traditional option for specifying a sort field. The field
2065 consists of the line between @var{pos1} and up to but @emph{not including}
2066 @var{pos2} (or the end of the line if @var{pos2} is omitted). Fields
2067 and character positions are numbered starting with 0. See below.
2071 In addition, when GNU @code{sort} is invoked with exactly one argument,
2072 options @samp{--help} and @samp{--version} are recognized. @xref{Common
2075 Historical (BSD and System V) implementations of @code{sort} have
2076 differed in their interpretation of some options, particularly
2077 @samp{-b}, @samp{-f}, and @samp{-n}. GNU sort follows the @sc{POSIX}
2078 behavior, which is usually (but not always!) like the System V behavior.
2079 According to @sc{POSIX}, @samp{-n} no longer implies @samp{-b}. For
2080 consistency, @samp{-M} has been changed in the same way. This may
2081 affect the meaning of character positions in field specifications in
2082 obscure cases. The only fix is to add an explicit @samp{-b}.
2084 A position in a sort field specified with the @samp{-k} or @samp{+}
2085 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
2086 of the field to use and @var{c} is the number of the first character
2087 from the beginning of the field (for @samp{+@var{pos}}) or from the end
2088 of the previous field (for @samp{-@var{pos}}). If the @samp{.@var{c}}
2089 is omitted, it is taken to be the first character in the field. If the
2090 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
2091 specification is counted from the first nonblank character of the field
2092 (for @samp{+@var{pos}}) or from the first nonblank character following
2093 the previous field (for @samp{-@var{pos}}).
2095 A sort key option may also have any of the option letters @samp{Mbdfinr}
2096 appended to it, in which case the global ordering options are not used
2097 for that particular field. The @samp{-b} option may be independently
2098 attached to either or both of the @samp{+@var{pos}} and
2099 @samp{-@var{pos}} parts of a field specification, and if it is inherited
2100 from the global options it will be attached to both.
2101 Keys may span multiple fields.
2103 Here are some examples to illustrate various combinations of options.
2104 In them, the @sc{POSIX} @samp{-k} option is used to specify sort keys rather
2105 than the obsolete @samp{+@var{pos1}-@var{pos2}} syntax.
2110 Sort in descending (reverse) numeric order.
2116 Sort alphabetically, omitting the first and second fields.
2117 This uses a single key composed of the characters beginning
2118 at the start of field three and extending to the end of each line.
2125 Sort numerically on the second field and resolve ties by sorting
2126 alphabetically on the third and fourth characters of field five.
2127 Use @samp{:} as the field delimiter.
2130 sort -t : -k 2,2n -k 5.3,5.4
2133 Note that if you had written @samp{-k 2} instead of @samp{-k 2,2}
2134 @samp{sort} would have used all characters beginning in the second field
2135 and extending to the end of the line as the primary @emph{numeric}
2136 key. For the large majority of applications, treating keys spanning
2137 more than one field as numeric will not do what you expect.
2139 Also note that the @samp{n} modifier was applied to the field-end
2140 specifier for the first key. It would have been equivalent to
2141 specify @samp{-k 2n,2} or @samp{-k 2n,2n}. All modifiers except
2142 @samp{b} apply to the associated @emph{field}, regardless of whether
2143 the modifier character is attached to the field-start and/or the
2144 field-end part of the key specifier.
2147 Sort the password file on the fifth field and ignore any
2148 leading white space. Sort lines with equal values in field five
2149 on the numeric user ID in field three.
2152 sort -t : -k 5b,5 -k 3,3n /etc/passwd
2155 An alternative is to use the global numeric modifier @samp{-n}.
2158 sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
2162 Generate a tags file in case insensitive sorted order.
2164 find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
2167 The use of @samp{-print0}, @samp{-z}, and @samp{-0} in this case mean
2168 that pathnames that contain Line Feed characters will not get broken up
2169 by the sort operation.
2171 Finally, to ignore both leading and trailing white space, you
2172 could have applied the @samp{b} modifier to the field-end specifier
2176 sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
2179 or by using the global @samp{-b} modifier instead of @samp{-n}
2180 and an explicit @samp{n} with the second key specifier.
2183 sort -t : -b -k 5,5 -k 3,3n /etc/passwd
2189 @node uniq invocation
2190 @section @code{uniq}: Uniqify files
2193 @cindex uniqify files
2195 @code{uniq} writes the unique lines in the given @file{input}, or
2196 standard input if nothing is given or for an @var{input} name of
2200 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
2203 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
2204 discards all but one of identical successive lines. Optionally, it can
2205 instead show only lines that appear exactly once, or lines that appear
2208 The input must be sorted. If your input is not sorted, perhaps you want
2209 to use @code{sort -u}.
2211 If no @var{output} file is specified, @code{uniq} writes to standard
2214 The program accepts the following options. Also see @ref{Common options}.
2220 @itemx --skip-fields=@var{n}
2223 @opindex --skip-fields
2224 Skip @var{n} fields on each line before checking for uniqueness. Fields
2225 are sequences of non-space non-tab characters that are separated from
2226 each other by at least one spaces or tabs.
2230 @itemx --skip-chars=@var{n}
2233 @opindex --skip-chars
2234 Skip @var{n} characters before checking for uniqueness. If you use both
2235 the field and character skipping options, fields are skipped over first.
2241 Print the number of times each line occurred along with the line.
2244 @itemx --ignore-case
2246 @opindex --ignore-case
2247 Ignore differences in case when comparing lines.
2253 @cindex duplicate lines, outputting
2254 Print only duplicate lines.
2260 @cindex unique lines, outputting
2261 Print only unique lines.
2264 @itemx --check-chars=@var{n}
2266 @opindex --check-chars
2267 Compare @var{n} characters on each line (after skipping any specified
2268 fields and characters). By default the entire rest of the lines are
2274 @node comm invocation
2275 @section @code{comm}: Compare two sorted files line by line
2278 @cindex line-by-line comparison
2279 @cindex comparing sorted files
2281 @code{comm} writes to standard output lines that are common, and lines
2282 that are unique, to two input files; a file name of @samp{-} means
2283 standard input. Synopsis:
2286 comm [@var{option}]@dots{} @var{file1} @var{file2}
2289 The input files must be sorted before @code{comm} can be used.
2291 @cindex differing lines
2292 @cindex common lines
2293 With no options, @code{comm} produces three column output. Column one
2294 contains lines unique to @var{file1}, column two contains lines unique
2295 to @var{file2}, and column three contains lines common to both files.
2296 Columns are separated by @key{TAB}.
2297 @c FIXME: when there's an option to supply an alternative separator
2298 @c string, append `by default' to the above sentence.
2303 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
2304 the corresponding columns. Also see @ref{Common options}.
2306 Unlike some other comparison utilities, @code{comm} has an exit
2307 status that does not depend on the result of the comparison.
2308 Upon normal completion @code{comm} produces an exit code of zero.
2309 If there is an error it exits with nonzero status.
2312 @node ptx invocation
2313 @section @code{ptx}: Produce permuted indexes
2317 @code{ptx} reads a text file and essentially produces a permuted index, with
2318 each keyword in its context. The calling sketch is either one of:
2321 ptx [@var{option} @dots{}] [@var{file} @dots{}]
2322 ptx -G [@var{option} @dots{}] [@var{input} [@var{output}]]
2325 The @samp{-G} (or its equivalent: @samp{--traditional}) option disables
2326 all GNU extensions and revert to traditional mode, thus introducing some
2327 limitations, and changes several of the program's default option values.
2328 When @samp{-G} is not specified, GNU extensions are always enabled. GNU
2329 extensions to @code{ptx} are documented wherever appropriate in this
2330 document. See @xref{Compatibility in ptx} for an explicit list of them.
2332 Individual options are explained in incoming sections.
2334 When GNU extensions are enabled, there may be zero, one or several
2335 @var{file} after the options. If there is no @var{file}, the program
2336 reads the standard input. If there is one or several @var{file}, they
2337 give the name of input files which are all read in turn, as if all the
2338 input files were concatenated. However, there is a full contextual
2339 break between each file and, when automatic referencing is requested,
2340 file names and line numbers refer to individual text input files. In
2341 all cases, the program produces the permuted index onto the standard
2344 When GNU extensions are @emph{not} enabled, that is, when the program
2345 operates in traditional mode, there may be zero, one or two parameters
2346 besides the options. If there is no parameters, the program reads the
2347 standard input and produces the permuted index onto the standard output.
2348 If there is only one parameter, it names the text @var{input} to be read
2349 instead of the standard input. If two parameters are given, they give
2350 respectively the name of the @var{input} file to read and the name of
2351 the @var{output} file to produce. @emph{Be very careful} to note that,
2352 in this case, the contents of file given by the second parameter is
2353 destroyed. This behaviour is dictated only by System V @code{ptx}
2354 compatibility, because GNU Standards discourage output parameters not
2355 introduced by an option.
2357 Note that for @emph{any} file named as the value of an option or as an
2358 input text file, a single dash @kbd{-} may be used, in which case
2359 standard input is assumed. However, it would not make sense to use this
2360 convention more than once per program invocation.
2363 * General options in ptx:: Options which affect general program behaviour.
2364 * Charset selection in ptx:: Underlying character set considerations.
2365 * Input processing in ptx:: Input fields, contexts, and keyword selection.
2366 * Output formatting in ptx:: Types of output format, and sizing the fields.
2367 * Compatibility in ptx::
2371 @node General options in ptx
2372 @subsection General options
2378 Prints a short note about the Copyright and copying conditions, then
2379 exit without further processing.
2382 @itemx --traditional
2383 As already explained, this option disables all GNU extensions to
2384 @code{ptx} and switch to traditional mode.
2387 Prints a short help on standard output, then exit without further
2391 Prints the program verison on standard output, then exit without further
2397 @node Charset selection in ptx
2398 @subsection Charset selection
2400 As it is setup now, the program assumes that the input file is coded
2401 using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
2402 @emph{unless} if it is compiled for MS-DOS, in which case it uses the
2403 character set of the IBM-PC. (GNU @code{ptx} is not known to work on
2404 smaller MS-DOS machines anymore.) Compared to 7-bit ASCII, the set of
2405 characters which are letters is then different, this fact alters the
2406 behaviour of regular expression matching. Thus, the default regular
2407 expression for a keyword allows foreign or diacriticized letters.
2408 Keyword sorting, however, is still crude; it obeys the underlying
2409 character set ordering quite blindly.
2414 @itemx --ignore-case
2415 Fold lower case letters to upper case for sorting.
2420 @node Input processing in ptx
2421 @subsection Word selection and input processing
2426 @item --break-file=@var{file}
2428 This option is an alternative way to option @code{-W} for describing
2429 which characters make up words. This option introduces the name of a
2430 file which contains a list of characters which can@emph{not} be part of
2431 one word, this file is called the @dfn{Break file}. Any character which
2432 is not part of the Break file is a word constituent. If both options
2433 @code{-b} and @code{-W} are specified, then @code{-W} has precedence and
2434 @code{-b} is ignored.
2436 When GNU extensions are enabled, the only way to avoid newline as a
2437 break character is to write all the break characters in the file with no
2438 newline at all, not even at the end of the file. When GNU extensions
2439 are disabled, spaces, tabs and newlines are always considered as break
2440 characters even if not included in the Break file.
2443 @itemx --ignore-file=@var{file}
2445 The file associated with this option contains a list of words which will
2446 never be taken as keywords in concordance output. It is called the
2447 @dfn{Ignore file}. The file contains exactly one word in each line; the
2448 end of line separation of words is not subject to the value of the
2451 There is a default Ignore file used by @code{ptx} when this option is
2452 not specified, usually found in @file{/usr/local/lib/eign} if this has
2453 not been changed at installation time. If you want to deactivate the
2454 default Ignore file, specify @code{/dev/null} instead.
2457 @itemx --only-file=@var{file}
2459 The file associated with this option contains a list of words which will
2460 be retained in concordance output, any word not mentioned in this file
2461 is ignored. The file is called the @dfn{Only file}. The file contains
2462 exactly one word in each line; the end of line separation of words is
2463 not subject to the value of the @code{-S} option.
2465 There is no default for the Only file. In the case there are both an
2466 Only file and an Ignore file, a word will be subject to be a keyword
2467 only if it is given in the Only file and not given in the Ignore file.
2472 On each input line, the leading sequence of non white characters will be
2473 taken to be a reference that has the purpose of identifying this input
2474 line on the produced permuted index. See @xref{Output formatting in ptx} for
2475 more information about reference production. Using this option change
2476 the default value for option @code{-S}.
2478 Using this option, the program does not try very hard to remove
2479 references from contexts in output, but it succeeds in doing so
2480 @emph{when} the context ends exactly at the newline. If option
2481 @code{-r} is used with @code{-S} default value, or when GNU extensions
2482 are disabled, this condition is always met and references are completely
2483 excluded from the output contexts.
2485 @item -S @var{regexp}
2486 @itemx --sentence-regexp=@var{regexp}
2488 This option selects which regular expression will describe the end of a
2489 line or the end of a sentence. In fact, there is other distinction
2490 between end of lines or end of sentences than the effect of this regular
2491 expression, and input line boundaries have no special significance
2492 outside this option. By default, when GNU extensions are enabled and if
2493 @code{-r} option is not used, end of sentences are used. In this
2494 case, the precise @var{regex} is imported from GNU emacs:
2497 [.?!][]\"')@}]*\\($\\|\t\\| \\)[ \t\n]*
2500 Whenever GNU extensions are disabled or if @code{-r} option is used, end
2501 of lines are used; in this case, the default @var{regexp} is just:
2507 Using an empty REGEXP is equivalent to completely disabling end of line or end
2508 of sentence recognition. In this case, the whole file is considered to
2509 be a single big line or sentence. The user might want to disallow all
2510 truncation flag generation as well, through option @code{-F ""}.
2511 @xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs
2514 When the keywords happen to be near the beginning of the input line or
2515 sentence, this often creates an unused area at the beginning of the
2516 output context line; when the keywords happen to be near the end of the
2517 input line or sentence, this often creates an unused area at the end of
2518 the output context line. The program tries to fill those unused areas
2519 by wrapping around context in them; the tail of the input line or
2520 sentence is used to fill the unused area on the left of the output line;
2521 the head of the input line or sentence is used to fill the unused area
2522 on the right of the output line.
2524 As a matter of convenience to the user, many usual backslashed escape
2525 sequences, as found in the C language, are recognized and converted to
2526 the corresponding characters by @code{ptx} itself.
2528 @item -W @var{regexp}
2529 @itemx --word-regexp=@var{regexp}
2531 This option selects which regular expression will describe each keyword.
2532 By default, if GNU extensions are enabled, a word is a sequence of
2533 letters; the @var{regexp} used is @code{\w+}. When GNU extensions are
2534 disabled, a word is by default anything which ends with a space, a tab
2535 or a newline; the @var{regexp} used is @code{[^ \t\n]+}.
2537 An empty REGEXP is equivalent to not using this option, letting the
2538 default dive in. @xref{Regexps, , Syntax of Regular Expressions, emacs,
2539 The GNU Emacs Manual}.
2541 As a matter of convenience to the user, many usual backslashed escape
2542 sequences, as found in the C language, are recognized and converted to
2543 the corresponding characters by @code{ptx} itself.
2548 @node Output formatting in ptx
2549 @subsection Output formatting
2551 Output format is mainly controlled by @code{-O} and @code{-T} options,
2552 described in the table below. When neither @code{-O} nor @code{-T} is
2553 selected, and if GNU extensions are enabled, the program choose an
2554 output format suited for a dumb terminal. Each keyword occurrence is
2555 output to the center of one line, surrounded by its left and right
2556 contexts. Each field is properly justified, so the concordance output
2557 could readily be observed. As a special feature, if automatic
2558 references are selected by option @code{-A} and are output before the
2559 left context, that is, if option @code{-R} is @emph{not} selected, then
2560 a colon is added after the reference; this nicely interfaces with GNU
2561 Emacs @code{next-error} processing. In this default output format, each
2562 white space character, like newline and tab, is merely changed to
2563 exactly one space, with no special attempt to compress consecutive
2564 spaces. This might change in the future. Except for those white space
2565 characters, every other character of the underlying set of 256
2566 characters is transmitted verbatim.
2568 Output format is further controlled by the following options.
2572 @item -g @var{number}
2573 @itemx --gap-size=@var{number}
2575 Select the size of the minimum white gap between the fields on the output
2578 @item -w @var{number}
2579 @itemx --width=@var{number}
2581 Select the output maximum width of each final line. If references are
2582 used, they are included or excluded from the output maximum width
2583 depending on the value of option @code{-R}. If this option is not
2584 selected, that is, when references are output before the left context,
2585 the output maximum width takes into account the maximum length of all
2586 references. If this options is selected, that is, when references are
2587 output after the right context, the output maximum width does not take
2588 into account the space taken by references, nor the gap that precedes
2592 @itemx --auto-reference
2594 Select automatic references. Each input line will have an automatic
2595 reference made up of the file name and the line ordinal, with a single
2596 colon between them. However, the file name will be empty when standard
2597 input is being read. If both @code{-A} and @code{-r} are selected, then
2598 the input reference is still read and skipped, but the automatic
2599 reference is used at output time, overriding the input reference.
2602 @itemx --right-side-refs
2604 In default output format, when option @code{-R} is not used, any
2605 reference produced by the effect of options @code{-r} or @code{-A} are
2606 given to the far right of output lines, after the right context. In
2607 default output format, when option @code{-R} is specified, references
2608 are rather given to the beginning of each output line, before the left
2609 context. For any other output format, option @code{-R} is almost
2610 ignored, except for the fact that the width of references is @emph{not}
2611 taken into account in total output width given by @code{-w} whenever
2612 @code{-R} is selected.
2614 This option is automatically selected whenever GNU extensions are
2617 @item -F @var{string}
2618 @itemx --flac-truncation=@var{string}
2620 This option will request that any truncation in the output be reported
2621 using the string @var{string}. Most output fields theoretically extend
2622 towards the beginning or the end of the current line, or current
2623 sentence, as selected with option @code{-S}. But there is a maximum
2624 allowed output line width, changeable through option @code{-w}, which is
2625 further divided into space for various output fields. When a field has
2626 to be truncated because cannot extend until the beginning or the end of
2627 the current line to fit in the, then a truncation occurs. By default,
2628 the string used is a single slash, as in @code{-F /}.
2630 @var{string} may have more than one character, as in @code{-F ...}.
2631 Also, in the particular case @var{string} is empty (@code{-F ""}),
2632 truncation flagging is disabled, and no truncation marks are appended in
2635 As a matter of convenience to the user, many usual backslashed escape
2636 sequences, as found in the C language, are recognized and converted to
2637 the corresponding characters by @code{ptx} itself.
2639 @item -M @var{string}
2640 @itemx --macro-name=@var{string}
2642 Select another @var{string} to be used instead of @samp{xx}, while
2643 generating output suitable for @code{nroff}, @code{troff} or @TeX{}.
2646 @itemx --format=roff
2648 Choose an output format suitable for @code{nroff} or @code{troff}
2649 processing. Each output line will look like:
2652 .xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}"
2655 so it will be possible to write an @samp{.xx} roff macro to take care of
2656 the output typesetting. This is the default output format when GNU
2657 extensions are disabled. Option @samp{-M} might be used to change
2658 @samp{xx} to another macro name.
2660 In this output format, each non-graphical character, like newline and
2661 tab, is merely changed to exactly one space, with no special attempt to
2662 compress consecutive spaces. Each quote character: @kbd{"} is doubled
2663 so it will be correctly processed by @code{nroff} or @code{troff}.
2668 Choose an output format suitable for @TeX{} processing. Each output
2669 line will look like:
2672 \xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@}
2676 so it will be possible to write write a @code{\xx} definition to take
2677 care of the output typesetting. Note that when references are not being
2678 produced, that is, neither option @code{-A} nor option @code{-r} is
2679 selected, the last parameter of each @code{\xx} call is inhibited.
2680 Option @samp{-M} might be used to change @samp{xx} to another macro
2683 In this output format, some special characters, like @kbd{$}, @kbd{%},
2684 @kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a
2685 backslash. Curly brackets @kbd{@{}, @kbd{@}} are also protected with a
2686 backslash, but also enclosed in a pair of dollar signs to force
2687 mathematical mode. The backslash itself produces the sequence
2688 @code{\backslash@{@}}. Circumflex and tilde diacritics produce the
2689 sequence @code{^\@{ @}} and @code{~\@{ @}} respectively. Other
2690 diacriticized characters of the underlying character set produce an
2691 appropriate @TeX{} sequence as far as possible. The other non-graphical
2692 characters, like newline and tab, and all others characters which are
2693 not part of ASCII, are merely changed to exactly one space, with no
2694 special attempt to compress consecutive spaces. Let me know how to
2695 improve this special character processing for @TeX{}.
2700 @node Compatibility in ptx
2701 @subsection The GNU extensions to @code{ptx}
2703 This version of @code{ptx} contains a few features which do not exist in
2704 System V @code{ptx}. These extra features are suppressed by using the
2705 @samp{-G} command line option, unless overridden by other command line
2706 options. Some GNU extensions cannot be recovered by overriding, so the
2707 simple rule is to avoid @samp{-G} if you care about GNU extensions.
2708 Here are the differences between this program and System V @code{ptx}.
2713 This program can read many input files at once, it always writes the
2714 resulting concordance on standard output. On the other end, System V
2715 @code{ptx} reads only one file and produce the result on standard output
2716 or, if a second @var{file} parameter is given on the command, to that
2719 Having output parameters not introduced by options is a quite dangerous
2720 practice which GNU avoids as far as possible. So, for using @code{ptx}
2721 portably between GNU and System V, you should pay attention to always
2722 use it with a single input file, and always expect the result on
2723 standard output. You might also want to automatically configure in a
2724 @samp{-G} option to @code{ptx} calls in products using @code{ptx}, if
2725 the configurator finds that the installed @code{ptx} accepts @samp{-G}.
2728 The only options available in System V @code{ptx} are options @samp{-b},
2729 @samp{-f}, @samp{-g}, @samp{-i}, @samp{-o}, @samp{-r}, @samp{-t} and
2730 @samp{-w}. All other options are GNU extensions and are not repeated in
2731 this enumeration. Moreover, some options have a slightly different
2732 meaning when GNU extensions are enabled, as explained below.
2735 By default, concordance output is not formatted for @code{troff} or
2736 @code{nroff}. It is rather formatted for a dumb terminal. @code{troff}
2737 or @code{nroff} output may still be selected through option @code{-O}.
2740 Unless @code{-R} option is used, the maximum reference width is
2741 subtracted from the total output line width. With GNU extensions
2742 disabled, width of references is not taken into account in the output
2743 line width computations.
2746 All 256 characters, even @kbd{NUL}s, are always read and processed from
2747 input file with no adverse effect, even if GNU extensions are disabled.
2748 However, System V @code{ptx} does not accept 8-bit characters, a few
2749 control characters are rejected, and the tilda @kbd{~} is condemned.
2752 Input line length is only limited by available memory, even if GNU
2753 extensions are disabled. However, System V @code{ptx} processes only
2754 the first 200 characters in each line.
2757 The break (non-word) characters default to be every character except all
2758 letters of the underlying character set, diacriticized or not. When GNU
2759 extensions are disabled, the break characters default to space, tab and
2763 The program makes better use of output line width. If GNU extensions
2764 are disabled, the program rather tries to imitate System V @code{ptx},
2765 but still, there are some slight disposition glitches this program does
2766 not completely reproduce.
2769 The user can specify both an Ignore file and an Only file. This is not
2770 allowed with System V @code{ptx}.
2775 @node Operating on fields within a line
2776 @chapter Operating on fields within a line
2779 * cut invocation:: Print selected parts of lines.
2780 * paste invocation:: Merge lines of files.
2781 * join invocation:: Join lines on a common field.
2785 @node cut invocation
2786 @section @code{cut}: Print selected parts of lines
2789 @code{cut} writes to standard output selected parts of each line of each
2790 input file, or standard input if no files are given or for a file name of
2794 cut [@var{option}]@dots{} [@var{file}]@dots{}
2797 In the table which follows, the @var{byte-list}, @var{character-list},
2798 and @var{field-list} are one or more numbers or ranges (two numbers
2799 separated by a dash) separated by commas. Bytes, characters, and
2800 fields are numbered from starting at 1. Incomplete ranges may be
2801 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
2802 @samp{@var{n}} through end of line or last field.
2804 The program accepts the following options. Also see @ref{Common
2809 @item -b @var{byte-list}
2810 @itemx --bytes=@var{byte-list}
2813 Print only the bytes in positions listed in @var{byte-list}. Tabs and
2814 backspaces are treated like any other character; they take up 1 byte.
2816 @item -c @var{character-list}
2817 @itemx --characters=@var{character-list}
2819 @opindex --characters
2820 Print only characters in positions listed in @var{character-list}.
2821 The same as @samp{-b} for now, but internationalization will change
2822 that. Tabs and backspaces are treated like any other character; they
2823 take up 1 character.
2825 @item -f @var{field-list}
2826 @itemx --fields=@var{field-list}
2829 Print only the fields listed in @var{field-list}. Fields are
2830 separated by a @key{TAB} by default.
2832 @item -d @var{delim}
2833 @itemx --delimiter=@var{delim}
2835 @opindex --delimiter
2836 For @samp{-f}, fields are separated by the first character in @var{delim}
2837 (default is @key{TAB}).
2841 Do not split multi-byte characters (no-op for now).
2844 @itemx --only-delimited
2846 @opindex --only-delimited
2847 For @samp{-f}, do not print lines that do not contain the field separator
2853 @node paste invocation
2854 @section @code{paste}: Merge lines of files
2857 @cindex merging files
2859 @code{paste} writes to standard output lines consisting of sequentially
2860 corresponding lines of each given file, separated by @key{TAB}.
2861 Standard input is used for a file name of @samp{-} or if no input files
2867 paste [@var{option}]@dots{} [@var{file}]@dots{}
2870 The program accepts the following options. Also see @ref{Common options}.
2878 Paste the lines of one file at a time rather than one line from each
2881 @item -d @var{delim-list}
2882 @itemx --delimiters @var{delim-list}
2884 @opindex --delimiters
2885 Consecutively use the characters in @var{delim-list} instead of
2886 @key{TAB} to separate merged lines. When @var{delim-list} is
2887 exhausted, start again at its beginning.
2892 @node join invocation
2893 @section @code{join}: Join lines on a common field
2896 @cindex common field, joining on
2898 @code{join} writes to standard output a line for each pair of input
2899 lines that have identical join fields. Synopsis:
2902 join [@var{option}]@dots{} @var{file1} @var{file2}
2905 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
2906 meaning standard input. @var{file1} and @var{file2} should be already
2907 sorted in increasing order (not numerically) on the join fields; unless
2908 the @samp{-t} option is given, they should be sorted ignoring blanks at
2909 the start of the join field, as in @code{sort -b}. If the
2910 @samp{--ignore-case} option is given, lines should be sorted without
2911 regard to the case of characters in the join field, as in @code{sort -f}.
2913 The defaults are: the join field is the first field in each line;
2914 fields in the input are separated by one or more blanks, with leading
2915 blanks on the line ignored; fields in the output are separated by a
2916 space; each output line consists of the join field, the remaining
2917 fields from @var{file1}, then the remaining fields from @var{file2}.
2919 The program accepts the following options. Also see @ref{Common options}.
2923 @item -a @var{file-number}
2925 Print a line for each unpairable line in file @var{file-number} (either
2926 @samp{1} or @samp{2}), in addition to the normal output.
2928 @item -e @var{string}
2930 Replace those output fields that are missing in the input with
2934 @itemx --ignore-case
2936 @opindex --ignore-case
2937 Ignore differences in case when comparing keys.
2938 With this option, the lines of the input files must be ordered in the same way.
2939 Use @samp{sort -f} to produce this ordering.
2941 @item -1 @var{field}
2942 @itemx -j1 @var{field}
2945 Join on field @var{field} (a positive integer) of file 1.
2947 @item -2 @var{field}
2948 @itemx -j2 @var{field}
2951 Join on field @var{field} (a positive integer) of file 2.
2953 @item -j @var{field}
2954 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
2956 @item -o @var{field-list}@dots{}
2957 Construct each output line according to the format in @var{field-list}.
2958 Each element in @var{field-list} is either the single character @samp{0} or
2959 has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
2960 @samp{2} and @var{n} is a positive field number.
2962 A field specification of @samp{0} denotes the join field.
2963 In most cases, the functionality of the @samp{0} field spec
2964 may be reproduced using the explicit @var{m.n} that corresponds
2965 to the join field. However, when printing unpairable lines
2966 (using either of the @samp{-a} or @samp{-v} options), there is no way
2967 to specify the join field using @var{m.n} in @var{field-list}
2968 if there are unpairable lines in both files.
2969 To give @code{join} that functionality, @sc{POSIX} invented the @samp{0}
2970 field specification notation.
2972 The elements in @var{field-list}
2973 are separated by commas or blanks. Multiple @var{field-list}
2974 arguments can be given after a single @samp{-o} option; the values
2975 of all lists given with @samp{-o} are concatenated together.
2976 All output lines -- including those printed because of any -a or -v
2977 option -- are subject to the specified @var{field-list}.
2980 Use character @var{char} as the input and output field separator.
2982 @item -v @var{file-number}
2983 Print a line for each unpairable line in file @var{file-number}
2984 (either @samp{1} or @samp{2}), instead of the normal output.
2988 In addition, when GNU @code{join} is invoked with exactly one argument,
2989 options @samp{--help} and @samp{--version} are recognized. @xref{Common
2993 @node Operating on characters
2994 @chapter Operating on characters
2996 @cindex operating on characters
2998 This commands operate on individual characters.
3001 * tr invocation:: Translate, squeeze, and/or delete characters.
3002 * expand invocation:: Convert tabs to spaces.
3003 * unexpand invocation:: Convert spaces to tabs.
3008 @section @code{tr}: Translate, squeeze, and/or delete characters
3015 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
3018 @code{tr} copies standard input to standard output, performing
3019 one of the following operations:
3023 translate, and optionally squeeze repeated characters in the result,
3025 squeeze repeated characters,
3029 delete characters, then squeeze repeated characters from the result.
3032 The @var{set1} and (if given) @var{set2} arguments define ordered
3033 sets of characters, referred to below as @var{set1} and @var{set2}. These
3034 sets are the characters of the input that @code{tr} operates on.
3035 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
3036 complement (all of the characters that are not in @var{set1}).
3039 * Character sets:: Specifying sets of characters.
3040 * Translating:: Changing one characters to another.
3041 * Squeezing:: Squeezing repeats and deleting.
3042 * Warnings in tr:: Warning messages.
3046 @node Character sets
3047 @subsection Specifying sets of characters
3049 @cindex specifying sets of characters
3051 The format of the @var{set1} and @var{set2} arguments resembles
3052 the format of regular expressions; however, they are not regular
3053 expressions, only lists of characters. Most characters simply
3054 represent themselves in these strings, but the strings can contain
3055 the shorthands listed below, for convenience. Some of them can be
3056 used only in @var{set1} or @var{set2}, as noted below.
3060 @item Backslash escapes
3061 @cindex backslash escapes
3063 A backslash followed by a character not listed below causes an error
3082 The character with the value given by @var{ooo}, which is 1 to 3
3091 The notation @samp{@var{m}-@var{n}} expands to all of the characters
3092 from @var{m} through @var{n}, in ascending order. @var{m} should
3093 collate before @var{n}; if it doesn't, an error results. As an example,
3094 @samp{0-9} is the same as @samp{0123456789}. Although GNU @code{tr}
3095 does not support the System V syntax that uses square brackets to
3096 enclose ranges, translations specified in that format will still work as
3097 long as the brackets in @var{string1} correspond to identical brackets
3100 @item Repeated characters
3101 @cindex repeated characters
3103 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
3104 copies of character @var{c}. Thus, @samp{[y*6]} is the same as
3105 @samp{yyyyyy}. The notation @samp{[@var{c}*]} in @var{string2} expands
3106 to as many copies of @var{c} as are needed to make @var{set2} as long as
3107 @var{set1}. If @var{n} begins with @samp{0}, it is interpreted in
3108 octal, otherwise in decimal.
3110 @item Character classes
3111 @cindex characters classes
3113 The notation @samp{[:@var{class}:]} expands to all of the characters in
3114 the (predefined) class @var{class}. The characters expand in no
3115 particular order, except for the @code{upper} and @code{lower} classes,
3116 which expand in ascending order. When the @samp{--delete} (@samp{-d})
3117 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
3118 character class can be used in @var{set2}. Otherwise, only the
3119 character classes @code{lower} and @code{upper} are accepted in
3120 @var{set2}, and then only if the corresponding character class
3121 (@code{upper} and @code{lower}, respectively) is specified in the same
3122 relative position in @var{set1}. Doing this specifies case conversion.
3123 The class names are given below; an error results when an invalid class
3135 Horizontal whitespace.
3144 Printable characters, not including space.
3150 Printable characters, including space.
3153 Punctuation characters.
3156 Horizontal or vertical whitespace.
3165 @item Equivalence classes
3166 @cindex equivalence classes
3168 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
3169 equivalent to @var{c}, in no particular order. Equivalence classes are
3170 a relatively recent invention intended to support non-English alphabets.
3171 But there seems to be no standard way to define them or determine their
3172 contents. Therefore, they are not fully implemented in GNU @code{tr};
3173 each character's equivalence class consists only of that character,
3174 which is of no particular use.
3180 @subsection Translating
3182 @cindex translating characters
3184 @code{tr} performs translation when @var{set1} and @var{set2} are
3185 both given and the @samp{--delete} (@samp{-d}) option is not given.
3186 @code{tr} translates each character of its input that is in @var{set1}
3187 to the corresponding character in @var{set2}. Characters not in
3188 @var{set1} are passed through unchanged. When a character appears more
3189 than once in @var{set1} and the corresponding characters in @var{set2}
3190 are not all the same, only the final one is used. For example, these
3191 two commands are equivalent:
3198 A common use of @code{tr} is to convert lowercase characters to
3199 uppercase. This can be done in many ways. Here are three of them:
3202 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
3204 tr '[:lower:]' '[:upper:]'
3207 When @code{tr} is performing translation, @var{set1} and @var{set2}
3208 typically have the same length. If @var{set1} is shorter than
3209 @var{set2}, the extra characters at the end of @var{set2} are ignored.
3211 On the other hand, making @var{set1} longer than @var{set2} is not
3212 portable; @sc{POSIX.2} says that the result is undefined. In this situation,
3213 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
3214 the last character of @var{set2} as many times as necessary. System V
3215 @code{tr} truncates @var{set1} to the length of @var{set2}.
3217 By default, GNU @code{tr} handles this case like BSD @code{tr}. When
3218 the @samp{--truncate-set1} (@samp{-t}) option is given, GNU @code{tr}
3219 handles this case like the System V @code{tr} instead. This option is
3220 ignored for operations other than translation.
3222 Acting like System V @code{tr} in this case breaks the relatively common
3226 tr -cs A-Za-z0-9 '\012'
3230 because it converts only zero bytes (the first element in the
3231 complement of @var{set1}), rather than all non-alphanumerics, to
3236 @subsection Squeezing repeats and deleting
3238 @cindex squeezing repeat characters
3239 @cindex deleting characters
3241 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
3242 removes any input characters that are in @var{set1}.
3244 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
3245 @code{tr} replaces each input sequence of a repeated character that
3246 is in @var{set1} with a single occurrence of that character.
3248 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
3249 first performs any deletions using @var{set1}, then squeezes repeats
3250 from any remaining characters using @var{set2}.
3252 The @samp{--squeeze-repeats} option may also be used when translating,
3253 in which case @code{tr} first performs translation, then squeezes
3254 repeats from any remaining characters using @var{set2}.
3256 Here are some examples to illustrate various combinations of options:
3261 Remove all zero bytes:
3268 Put all words on lines by themselves. This converts all
3269 non-alphanumeric characters to newlines, then squeezes each string
3270 of repeated newlines into a single newline:
3273 tr -cs '[a-zA-Z0-9]' '[\n*]'
3277 Convert each sequence of repeated newlines to a single newline:
3284 Find doubled occurrences of words in a document.
3285 For example, people often write ``the the'' with the duplicated words
3286 separated by a newline. The bourne shell script below works first
3287 by converting each sequence of punctuation and blank characters to a
3288 single newline. That puts each ``word'' on a line by itself.
3289 Next it maps all uppercase characters to lower case, and finally it
3290 runs @code{uniq} with the @samp{-d} option to print out only the words
3291 that were adjacent duplicates.
3296 | tr -s '[:punct:][:blank:]' '\n' \
3297 | tr '[:upper:]' '[:lower:]' \
3304 @node Warnings in tr
3305 @subsection Warning messages
3307 @vindex POSIXLY_CORRECT
3308 Setting the environment variable @code{POSIXLY_CORRECT} turns off the
3309 following warning and error messages, for strict compliance with
3310 @sc{POSIX.2}. Otherwise, the following diagnostics are issued:
3315 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
3316 is not, and @var{set2} is given, GNU @code{tr} by default prints
3317 a usage message and exits, because @var{set2} would not be used.
3318 The @sc{POSIX} specification says that @var{set2} must be ignored in
3319 this case. Silently ignoring arguments is a bad idea.
3322 When an ambiguous octal escape is given. For example, @samp{\400}
3323 is actually @samp{\40} followed by the digit @samp{0}, because the
3324 value 400 octal does not fit into a single byte.
3328 GNU @code{tr} does not provide complete BSD or System V compatibility.
3329 For example, it is impossible to disable interpretation of the @sc{POSIX}
3330 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}. Also, GNU
3331 @code{tr} does not delete zero bytes automatically, unlike traditional
3332 Unix versions, which provide no way to preserve zero bytes.
3335 @node expand invocation
3336 @section @code{expand}: Convert tabs to spaces
3339 @cindex tabs to spaces, converting
3340 @cindex converting tabs to spaces
3342 @code{expand} writes the contents of each given @var{file}, or standard
3343 input if none are given or for a @var{file} of @samp{-}, to standard
3344 output, with tab characters converted to the appropriate number of
3348 expand [@var{option}]@dots{} [@var{file}]@dots{}
3351 By default, @code{expand} converts all tabs to spaces. It preserves
3352 backspace characters in the output; they decrement the column count for
3353 tab calculations. The default action is equivalent to @samp{-8} (set
3354 tabs every 8 columns).
3356 The program accepts the following options. Also see @ref{Common options}.
3360 @item -@var{tab1}[,@var{tab2}]@dots{}
3361 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3362 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3366 @cindex tabstops, setting
3367 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3368 (default is 8). Otherwise, set the tabs at columns @var{tab1},
3369 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
3370 last tabstop given with single spaces. If the tabstops are specified
3371 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3372 blanks as well as by commas.
3378 @cindex initial tabs, converting
3379 Only convert initial tabs (those that precede all non-space or non-tab
3380 characters) on each line to spaces.
3385 @node unexpand invocation
3386 @section @code{unexpand}: Convert spaces to tabs
3390 @code{unexpand} writes the contents of each given @var{file}, or
3391 standard input if none are given or for a @var{file} of @samp{-}, to
3392 standard output, with strings of two or more space or tab characters
3393 converted to as many tabs as possible followed by as many spaces as are
3397 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
3400 By default, @code{unexpand} converts only initial spaces and tabs (those
3401 that precede all non space or tab characters) on each line. It
3402 preserves backspace characters in the output; they decrement the column
3403 count for tab calculations. By default, tabs are set at every 8th
3406 The program accepts the following options. Also see @ref{Common options}.
3410 @item -@var{tab1}[,@var{tab2}]@dots{}
3411 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3412 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3416 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3417 instead of the default 8. Otherwise, set the tabs at columns
3418 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
3419 tabs beyond the tabstops given unchanged. If the tabstops are specified
3420 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3421 blanks as well as by commas. This option implies the @samp{-a} option.
3427 Convert all strings of two or more spaces or tabs, not just initial
3434 @node Opening the software toolbox
3435 @chapter Opening the software toolbox
3437 This chapter originally appeared in @cite{Linux Journal}, volume 1,
3438 number 2, in the @cite{What's GNU?} column. It was written by Arnold
3442 * Toolbox introduction:: Toolbox introduction
3443 * I/O redirection:: I/O redirection
3444 * The who command:: The @code{who} command
3445 * The cut command:: The @code{cut} command
3446 * The sort command:: The @code{sort} command
3447 * The uniq command:: The @code{uniq} command
3448 * Putting the tools together:: Putting the tools together
3452 @node Toolbox introduction
3453 @unnumberedsec Toolbox introduction
3455 This month's column is only peripherally related to the GNU Project, in
3456 that it describes a number of the GNU tools on your Linux system and how they
3457 might be used. What it's really about is the ``Software Tools'' philosophy
3458 of program development and usage.
3460 The software tools philosophy was an important and integral concept
3461 in the initial design and development of Unix (of which Linux and GNU are
3462 essentially clones). Unfortunately, in the modern day press of
3463 Internetworking and flashy GUIs, it seems to have fallen by the
3464 wayside. This is a shame, since it provides a powerful mental model
3465 for solving many kinds of problems.
3467 Many people carry a Swiss Army knife around in their pants pockets (or
3468 purse). A Swiss Army knife is a handy tool to have: it has several knife
3469 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
3470 a number of other things on it. For the everyday, small miscellaneous jobs
3471 where you need a simple, general purpose tool, it's just the thing.
3473 On the other hand, an experienced carpenter doesn't build a house using
3474 a Swiss Army knife. Instead, he has a toolbox chock full of specialized
3475 tools---a saw, a hammer, a screwdriver, a plane, and so on. And he knows
3476 exactly when and where to use each tool; you won't catch him hammering nails
3477 with the handle of his screwdriver.
3479 The Unix developers at Bell Labs were all professional programmers and trained
3480 computer scientists. They had found that while a one-size-fits-all program
3481 might appeal to a user because there's only one program to use, in practice
3489 difficult to maintain and
3493 difficult to extend to meet new situations.
3496 Instead, they felt that programs should be specialized tools. In short, each
3497 program ``should do one thing well.'' No more and no less. Such programs are
3498 simpler to design, write, and get right---they only do one thing.
3500 Furthermore, they found that with the right machinery for hooking programs
3501 together, that the whole was greater than the sum of the parts. By combining
3502 several special purpose programs, you could accomplish a specific task
3503 that none of the programs was designed for, and accomplish it much more
3504 quickly and easily than if you had to write a special purpose program.
3505 We will see some (classic) examples of this further on in the column.
3506 (An important additional point was that, if necessary, take a detour
3507 and build any software tools you may need first, if you don't already
3508 have something appropriate in the toolbox.)
3510 @node I/O redirection
3511 @unnumberedsec I/O redirection
3513 Hopefully, you are familiar with the basics of I/O redirection in the
3514 shell, in particular the concepts of ``standard input,'' ``standard output,''
3515 and ``standard error''. Briefly, ``standard input'' is a data source, where
3516 data comes from. A program should not need to either know or care if the
3517 data source is a disk file, a keyboard, a magnetic tape, or even a punched
3518 card reader. Similarly, ``standard output'' is a data sink, where data goes
3519 to. The program should neither know nor care where this might be.
3520 Programs that only read their standard input, do something to the data,
3521 and then send it on, are called ``filters'', by analogy to filters in a
3524 With the Unix shell, it's very easy to set up data pipelines:
3527 program_to_create_data | filter1 | .... | filterN > final.pretty.data
3530 We start out by creating the raw data; each filter applies some successive
3531 transformation to the data, until by the time it comes out of the pipeline,
3532 it is in the desired form.
3534 This is fine and good for standard input and standard output. Where does the
3535 standard error come in to play? Well, think about @code{filter1} in
3536 the pipeline above. What happens if it encounters an error in the data it
3537 sees? If it writes an error message to standard output, it will just
3538 disappear down the pipeline into @code{filter2}'s input, and the
3539 user will probably never see it. So programs need a place where they can send
3540 error messages so that the user will notice them. This is standard error,
3541 and it is usually connected to your console or window, even if you have
3542 redirected standard output of your program away from your screen.
3544 For filter programs to work together, the format of the data has to be
3545 agreed upon. The most straightforward and easiest format to use is simply
3546 lines of text. Unix data files are generally just streams of bytes, with
3547 lines delimited by the @sc{ASCII} @sc{LF} (Line Feed) character,
3548 conventionally called a ``newline'' in the Unix literature. (This is
3549 @code{'\n'} if you're a C programmer.) This is the format used by all
3550 the traditional filtering programs. (Many earlier operating systems
3551 had elaborate facilities and special purpose programs for managing
3552 binary data. Unix has always shied away from such things, under the
3553 philosophy that it's easiest to simply be able to view and edit your
3554 data with a text editor.)
3556 OK, enough introduction. Let's take a look at some of the tools, and then
3557 we'll see how to hook them together in interesting ways. In the following
3558 discussion, we will only present those command line options that interest
3559 us. As you should always do, double check your system documentation
3562 @node The who command
3563 @unnumberedsec The @code{who} command
3565 The first program is the @code{who} command. By itself, it generates a
3566 list of the users who are currently logged in. Although I'm writing
3567 this on a single-user system, we'll pretend that several people are
3572 arnold console Jan 22 19:57
3573 miriam ttyp0 Jan 23 14:19(:0.0)
3574 bill ttyp1 Jan 21 09:32(:0.0)
3575 arnold ttyp2 Jan 23 20:48(:0.0)
3578 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
3579 There are three people logged in, and I am logged in twice. On traditional
3580 Unix systems, user names are never more than eight characters long. This
3581 little bit of trivia will be useful later. The output of @code{who} is nice,
3582 but the data is not all that exciting.
3584 @node The cut command
3585 @unnumberedsec The @code{cut} command
3587 The next program we'll look at is the @code{cut} command. This program
3588 cuts out columns or fields of input data. For example, we can tell it
3589 to print just the login name and full name from the @file{/etc/passwd
3590 file}. The @file{/etc/passwd} file has seven fields, separated by
3594 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
3597 To get the first and fifth fields, we would use cut like this:
3600 $ cut -d: -f1,5 /etc/passwd
3603 arnold:Arnold D. Robbins
3604 miriam:Miriam A. Robbins
3608 With the @samp{-c} option, @code{cut} will cut out specific characters
3609 (i.e., columns) in the input lines. This command looks like it might be
3610 useful for data filtering.
3613 @node The sort command
3614 @unnumberedsec The @code{sort} command
3616 Next we'll look at the @code{sort} command. This is one of the most
3617 powerful commands on a Unix-style system; one that you will often find
3618 yourself using when setting up fancy data plumbing. The @code{sort}
3619 command reads and sorts each file named on the command line. It then
3620 merges the sorted data and writes it to standard output. It will read
3621 standard input if no files are given on the command line (thus
3622 making it into a filter). The sort is based on the machine collating
3623 sequence (@sc{ASCII}) or based on user-supplied ordering criteria.
3626 @node The uniq command
3627 @unnumberedsec The @code{uniq} command
3629 Finally (at least for now), we'll look at the @code{uniq} program. When
3630 sorting data, you will often end up with duplicate lines, lines that
3631 are identical. Usually, all you need is one instance of each line.
3632 This is where @code{uniq} comes in. The @code{uniq} program reads its
3633 standard input, which it expects to be sorted. It only prints out one
3634 copy of each duplicated line. It does have several options. Later on,
3635 we'll use the @samp{-c} option, which prints each unique line, preceded
3636 by a count of the number of times that line occurred in the input.
3639 @node Putting the tools together
3640 @unnumberedsec Putting the tools together
3642 Now, let's suppose this is a large BBS system with dozens of users
3643 logged in. The management wants the SysOp to write a program that will
3644 generate a sorted list of logged in users. Furthermore, even if a user
3645 is logged in multiple times, his or her name should only show up in the
3648 The SysOp could sit down with the system documentation and write a C
3649 program that did this. It would take perhaps a couple of hundred lines
3650 of code and about two hours to write it, test it, and debug it.
3651 However, knowing the software toolbox, the SysOp can instead start out
3652 by generating just a list of logged on users:
3662 Next, sort the list:
3665 $ who | cut -c1-8 | sort
3672 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
3675 $ who | cut -c1-8 | sort | uniq
3681 The @code{sort} command actually has a @samp{-u} option that does what
3682 @code{uniq} does. However, @code{uniq} has other uses for which one
3683 cannot substitute @samp{sort -u}.
3685 The SysOp puts this pipeline into a shell script, and makes it available for
3686 all the users on the system:
3689 # cat > /usr/local/bin/listusers
3690 who | cut -c1-8 | sort | uniq
3692 # chmod +x /usr/local/bin/listusers
3695 There are four major points to note here. First, with just four
3696 programs, on one command line, the SysOp was able to save about two
3697 hours worth of work. Furthermore, the shell pipeline is just about as
3698 efficient as the C program would be, and it is much more efficient in
3699 terms of programmer time. People time is much more expensive than
3700 computer time, and in our modern ``there's never enough time to do
3701 everything'' society, saving two hours of programmer time is no mean
3704 Second, it is also important to emphasize that with the
3705 @emph{combination} of the tools, it is possible to do a special
3706 purpose job never imagined by the authors of the individual programs.
3708 Third, it is also valuable to build up your pipeline in stages, as we did here.
3709 This allows you to view the data at each stage in the pipeline, which helps
3710 you acquire the confidence that you are indeed using these tools correctly.
3712 Finally, by bundling the pipeline in a shell script, other users can use
3713 your command, without having to remember the fancy plumbing you set up for
3714 them. In terms of how you run them, shell scripts and compiled programs are
3717 After the previous warm-up exercise, we'll look at two additional, more
3718 complicated pipelines. For them, we need to introduce two more tools.
3720 The first is the @code{tr} command, which stands for ``transliterate.''
3721 The @code{tr} command works on a character-by-character basis, changing
3722 characters. Normally it is used for things like mapping upper case to
3726 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]'
3727 this example has mixed case!
3730 There are several options of interest:
3734 work on the complement of the listed characters, i.e.,
3735 operations apply to characters not in the given set
3738 delete characters in the first set from the output
3741 squeeze repeated characters in the output into just one character.
3744 We will be using all three options in a moment.
3746 The other command we'll look at is @code{comm}. The @code{comm}
3747 command takes two sorted input files as input data, and prints out the
3748 files' lines in three columns. The output columns are the data lines
3749 unique to the first file, the data lines unique to the second file, and
3750 the data lines that are common to both. The @samp{-1}, @samp{-2}, and
3751 @samp{-3} command line options omit the respective columns. (This is
3752 non-intuitive and takes a little getting used to.) For example:
3774 The single dash as a filename tells @code{comm} to read standard input
3775 instead of a regular file.
3777 Now we're ready to build a fancy pipeline. The first application is a word
3778 frequency counter. This helps an author determine if he or she is over-using
3781 The first step is to change the case of all the letters in our input file
3782 to one case. ``The'' and ``the'' are the same word when doing counting.
3785 $ tr '[A-Z]' '[a-z]' < whats.gnu | ...
3788 The next step is to get rid of punctuation. Quoted words and unquoted words
3789 should be treated identically; it's easiest to just get the punctuation out of
3793 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
3796 The second @code{tr} command operates on the complement of the listed
3797 characters, which are all the letters, the digits, the underscore, and
3798 the blank. The @samp{\012} represents the newline character; it has to
3799 be left alone. (The ASCII TAB character should also be included for
3800 good measure in a production script.)
3802 At this point, we have data consisting of words separated by blank space.
3803 The words only contain alphanumeric characters (and the underscore). The
3804 next step is break the data apart so that we have one word per line. This
3805 makes the counting operation much easier, as we will see shortly.
3808 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3809 > tr -s '[ ]' '\012' | ...
3812 This command turns blanks into newlines. The @samp{-s} option squeezes
3813 multiple newline characters in the output into just one. This helps us
3814 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
3815 This is what the shell prints when it notices you haven't finished
3816 typing in all of a command.)
3818 We now have data consisting of one word per line, no punctuation, all one
3819 case. We're ready to count each word:
3822 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3823 > tr -s '[ ]' '\012' | sort | uniq -c | ...
3826 At this point, the data might look something like this:
3839 The output is sorted by word, not by count! What we want is the most
3840 frequently used words first. Fortunately, this is easy to accomplish,
3841 with the help of two more @code{sort} options:
3845 do a numeric sort, not an ASCII one
3848 reverse the order of the sort
3851 The final pipeline looks like this:
3854 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3855 > tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
3864 Whew! That's a lot to digest. Yet, the same principles apply. With six
3865 commands, on two lines (really one long one split for convenience), we've
3866 created a program that does something interesting and useful, in much
3867 less time than we could have written a C program to do the same thing.
3869 A minor modification to the above pipeline can give us a simple spelling
3870 checker! To determine if you've spelled a word correctly, all you have to
3871 do is look it up in a dictionary. If it is not there, then chances are
3872 that your spelling is incorrect. So, we need a dictionary. If you
3873 have the Slackware Linux distribution, you have the file
3874 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
3877 Now, how to compare our file with the dictionary? As before, we generate
3878 a sorted list of words, one per line:
3881 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3882 > tr -s '[ ]' '\012' | sort -u | ...
3885 Now, all we need is a list of words that are @emph{not} in the
3886 dictionary. Here is where the @code{comm} command comes in.
3889 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
3890 > tr -s '[ ]' '\012' | sort -u |
3891 > comm -23 - /usr/lib/ispell/ispell.words
3894 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
3895 dictionary (the second file), and lines that are in both files. Lines
3896 only in the first file (standard input, our stream of words), are
3897 words that are not in the dictionary. These are likely candidates for
3898 spelling errors. This pipeline was the first cut at a production
3899 spelling checker on Unix.
3901 There are some other tools that deserve brief mention.
3905 search files for text that matches a regular expression
3908 like @code{grep}, but with more powerful regular expressions
3911 count lines, words, characters
3914 a T-fitting for data pipes, copies data to files and to standard output
3917 the stream editor, an advanced tool
3920 a data manipulation language, another advanced tool
3923 The software tools philosophy also espoused the following bit of
3924 advice: ``Let someone else do the hard part.'' This means, take
3925 something that gives you most of what you need, and then massage it the
3926 rest of the way until it's in the form that you want.
3932 Each program should do one thing well. No more, no less.
3935 Combining programs with appropriate plumbing leads to results where
3936 the whole is greater than the sum of the parts. It also leads to novel
3937 uses of programs that the authors might never have imagined.
3940 Programs should never print extraneous header or trailer data, since these
3941 could get sent on down a pipeline. (A point we didn't mention earlier.)
3944 Let someone else do the hard part.
3947 Know your toolbox! Use each program appropriately. If you don't have an
3948 appropriate tool, build one.
3951 As of this writing, all the programs we've discussed are available via
3952 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
3953 @file{/pub/gnu/textutils-1.9.tar.gz} directory.@footnote{Version 1.9 was
3954 current when this column was written. Check the nearest GNU archive for
3955 the current version.}
3957 None of what I have presented in this column is new. The Software Tools
3958 philosophy was first introduced in the book @cite{Software Tools},
3959 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
3960 0-201-03669-X). This book showed how to write and use software
3961 tools. It was written in 1976, using a preprocessor for FORTRAN named
3962 @code{ratfor} (RATional FORtran). At the time, C was not as ubiquitous
3963 as it is now; FORTRAN was. The last chapter presented a @code{ratfor}
3964 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
3965 awful lot like C; if you know C, you won't have any problem following
3968 In 1981, the book was updated and made available as @cite{Software
3969 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7). Both books
3970 remain in print, and are well worth reading if you're a programmer.
3971 They certainly made a major change in how I view programming.
3973 Initially, the programs in both books were available (on 9-track tape)
3974 from Addison-Wesley. Unfortunately, this is no longer the case,
3975 although you might be able to find copies floating around the Internet.
3976 For a number of years, there was an active Software Tools Users Group,
3977 whose members had ported the original @code{ratfor} programs to essentially
3978 every computer system with a FORTRAN compiler. The popularity of the
3979 group waned in the middle '80s as Unix began to spread beyond universities.
3981 With the current proliferation of GNU code and other clones of Unix programs,
3982 these programs now receive little attention; modern C versions are
3983 much more efficient and do more than these programs do. Nevertheless, as
3984 exposition of good programming style, and evangelism for a still-valuable
3985 philosophy, these books are unparalleled, and I recommend them highly.
3987 Acknowledgment: I would like to express my gratitude to Brian Kernighan
3988 of Bell Labs, the original Software Toolsmith, for reviewing this column.
4000 @c texinfo-column-for-description: 32