3 @setfilename textutils.info
4 @settitle GNU text utilities
12 @c Put everything in one index (arbitrarily chosen to be the concept index).
22 * Text utilities: (textutils). GNU text utilities.
23 * cat: (textutils)cat invocation. Concatenate and write files.
24 * cksum: (textutils)cksum invocation. Print @sc{posix} CRC checksum.
25 * comm: (textutils)comm invocation. Compare sorted files by line.
26 * csplit: (textutils)csplit invocation. Split by context.
27 * cut: (textutils)cut invocation. Print selected parts of lines.
28 * expand: (textutils)expand invocation. Convert tabs to spaces.
29 * fmt: (textutils)fmt invocation. Reformat paragraph text.
30 * fold: (textutils)fold invocation. Wrap long input lines.
31 * head: (textutils)head invocation. Output the first part of files.
32 * join: (textutils)join invocation. Join lines on a common field.
33 * md5sum: (textutils)md5sum invocation. Print or check message-digests.
34 * nl: (textutils)nl invocation. Number lines and write files.
35 * od: (textutils)od invocation. Dump files in octal, etc.
36 * paste: (textutils)paste invocation. Merge lines of files.
37 * pr: (textutils)pr invocation. Paginate or columnate files.
38 * ptx: (textutils)ptx invocation. Produce permuted indexes.
39 * sort: (textutils)sort invocation. Sort text files.
40 * split: (textutils)split invocation. Split into fixed-size pieces.
41 * sum: (textutils)sum invocation. Print traditional checksum.
42 * tac: (textutils)tac invocation. Reverse files.
43 * tail: (textutils)tail invocation. Output the last part of files.
44 * tsort: (textutils)tsort invocation. Topological sort.
45 * tr: (textutils)tr invocation. Translate characters.
46 * unexpand: (textutils)unexpand invocation. Convert spaces to tabs.
47 * uniq: (textutils)uniq invocation. Uniquify files.
48 * wc: (textutils)wc invocation. Byte, word, and line counts.
54 This file documents the GNU text utilities.
56 Copyright (C) 1994, 95, 96 Free Software Foundation, Inc.
58 Permission is granted to make and distribute verbatim copies of
59 this manual provided the copyright notice and this permission notice
60 are preserved on all copies.
63 Permission is granted to process this file through TeX and print the
64 results, provided the printed document carries copying permission
65 notice identical to this one except for the removal of this paragraph
66 (this paragraph not being relevant to the printed manual).
69 Permission is granted to copy and distribute modified versions of this
70 manual under the conditions for verbatim copying, provided that the entire
71 resulting derived work is distributed under the terms of a permission
72 notice identical to this one.
74 Permission is granted to copy and distribute translations of this manual
75 into another language, under the above conditions for modified versions,
76 except that this permission notice may be stated in a translation approved
81 @title GNU @code{textutils}
82 @subtitle A set of text utilities
83 @subtitle for version @value{VERSION}, @value{UPDATED}
84 @author David MacKenzie et al.
87 @vskip 0pt plus 1filll
88 Copyright @copyright{} 1994, 95, 96 Free Software Foundation, Inc.
90 Permission is granted to make and distribute verbatim copies of
91 this manual provided the copyright notice and this permission notice
92 are preserved on all copies.
94 Permission is granted to copy and distribute modified versions of this
95 manual under the conditions for verbatim copying, provided that the entire
96 resulting derived work is distributed under the terms of a permission
97 notice identical to this one.
99 Permission is granted to copy and distribute translations of this manual
100 into another language, under the above conditions for modified versions,
101 except that this permission notice may be stated in a translation approved
106 @c If your makeinfo doesn't grok this @ifnottex directive, then either
107 @c get a newer version of makeinfo or do s/ifnottex/ifinfo/ here and on
108 @c the matching @end directive below.
111 @top GNU text utilities
113 @cindex text utilities
114 @cindex utilities for text handling
116 This manual documents version @value{VERSION} of the GNU text utilities.
119 * Introduction:: Caveats, overview, and authors.
120 * Common options:: Common options.
121 * Output of entire files:: cat tac nl od
122 * Formatting file contents:: fmt pr fold
123 * Output of parts of files:: head tail split csplit
124 * Summarizing files:: wc sum cksum md5sum
125 * Operating on sorted files:: sort uniq comm ptx tsort
126 * Operating on fields within a line:: cut paste join
127 * Operating on characters:: tr expand unexpand
128 * Opening the software toolbox:: The software tools philosophy.
129 * Index:: General index.
132 --- The Detailed Node Listing ---
134 Output of entire files
136 * cat invocation:: Concatenate and write files.
137 * tac invocation:: Concatenate and write files in reverse.
138 * nl invocation:: Number lines and write files.
139 * od invocation:: Write files in octal or other formats.
141 Formatting file contents
143 * fmt invocation:: Reformat paragraph text.
144 * pr invocation:: Paginate or columnate files for printing.
145 * fold invocation:: Wrap input lines to fit in specified width.
147 Output of parts of files
149 * head invocation:: Output the first part of files.
150 * tail invocation:: Output the last part of files.
151 * split invocation:: Split a file into fixed-size pieces.
152 * csplit invocation:: Split a file into context-determined pieces.
156 * wc invocation:: Print byte, word, and line counts.
157 * sum invocation:: Print checksum and block counts.
158 * cksum invocation:: Print CRC checksum and byte counts.
159 * md5sum invocation:: Print or check message-digests.
161 Operating on sorted files
163 * sort invocation:: Sort text files.
164 * uniq invocation:: Uniquify files.
165 * comm invocation:: Compare two sorted files line by line.
166 * ptx invocation:: Produce a permuted index of file contents.
167 * tsort invocation:: Topological sort.
169 @code{ptx}: Produce permuted indexes
171 * General options in ptx:: Options which affect general program behavior.
172 * Charset selection in ptx:: Underlying character set considerations.
173 * Input processing in ptx:: Input fields, contexts, and keyword selection.
174 * Output formatting in ptx:: Types of output format, and sizing the fields.
175 * Compatibility in ptx:: The GNU extensions to @code{ptx}
177 Operating on fields within a line
179 * cut invocation:: Print selected parts of lines.
180 * paste invocation:: Merge lines of files.
181 * join invocation:: Join lines on a common field.
183 Operating on characters
185 * tr invocation:: Translate, squeeze, and/or delete characters.
186 * expand invocation:: Convert tabs to spaces.
187 * unexpand invocation:: Convert spaces to tabs.
189 @code{tr}: Translate, squeeze, and/or delete characters
191 * Character sets:: Specifying sets of characters.
192 * Translating:: Changing one characters to another.
193 * Squeezing:: Squeezing repeats and deleting.
194 * Warnings in tr:: Warning messages.
196 Opening the software toolbox
198 * Toolbox introduction:: Toolbox introduction
199 * I/O redirection:: I/O redirection
200 * The who command:: The @code{who} command
201 * The cut command:: The @code{cut} command
202 * The sort command:: The @code{sort} command
203 * The uniq command:: The @code{uniq} command
204 * Putting the tools together:: Putting the tools together
213 @chapter Introduction
217 This manual is incomplete: No attempt is made to explain basic concepts
218 in a way suitable for novices. Thus, if you are interested, please get
219 involved in improving this manual. The entire GNU community will
223 The GNU text utilities are mostly compatible with the @sc{posix.2} standard.
225 @c This paragraph appears in all of fileutils.texi, textutils.texi, and
226 @c sh-utils.texi too -- so be sure to keep them consistent.
227 @cindex bugs, reporting
228 Please report bugs to @email{bug-textutils@@gnu.org}. Remember
229 to include the version number, machine architecture, input files, and
230 any other information needed to reproduce the bug: your input, what you
231 expected, what you got, and why it is wrong. Diffs are welcome, but
232 please include a description of the problem as well, since this is
233 sometimes difficult to infer. @xref{Bugs, , , gcc, GNU CC}.
235 This manual was originally derived from the Unix man pages in the
236 distribution, which were written by David MacKenzie and updated by Jim
237 Meyering. What you are reading now is the authoritative documentation
238 for these utilities; the man pages are no longer being maintained.
239 The original @code{fmt} man page was written by Ross Paterson.
240 Fran@,{c}ois Pinard did the initial conversion to Texinfo format.
241 Karl Berry did the indexing, some reorganization, and editing of the results.
242 Richard Stallman contributed his usual invaluable insights to the
247 @chapter Common options
249 @cindex common options
251 Certain options are available in all these programs. Rather than
252 writing identical descriptions for each of the programs, they are
253 described here. (In fact, every GNU program accepts (or should accept)
256 A few of these programs take arbitrary strings as arguments. In those
257 cases, @samp{--help} and @samp{--version} are taken as these options
258 only if there is one and exactly one command line argument.
265 Print a usage message listing all available options, then exit successfully.
269 @cindex version number, finding
270 Print the version number, then exit successfully.
275 @node Output of entire files
276 @chapter Output of entire files
278 @cindex output of entire files
279 @cindex entire files, output of
281 These commands read and write entire files, possibly transforming them
285 * cat invocation:: Concatenate and write files.
286 * tac invocation:: Concatenate and write files in reverse.
287 * nl invocation:: Number lines and write files.
288 * od invocation:: Write files in octal or other formats.
292 @section @code{cat}: Concatenate and write files
295 @cindex concatenate and write files
296 @cindex copying files
298 @code{cat} copies each @var{file} (@samp{-} means standard input), or
299 standard input if none are given, to standard output. Synopsis:
302 cat [@var{option}] [@var{file}]@dots{}
305 The program accepts the following options. Also see @ref{Common options}.
313 Equivalent to @samp{-vET}.
319 @cindex binary and text I/O in cat
320 On MS-DOS and MS-Windows only, read and write the
321 files in binary mode. By default, @code{cat} on MS-DOS/MS-Windows uses
322 binary mode only when standard output is redirected to a file or a pipe;
323 this option overrides that. Binary file I/O is used so that the files
324 retain their format (Unix text as opposed to DOS text and binary),
325 because @code{cat} is frequently used as a file-copying program. Some
326 options (see below) cause @code{cat} read and write files in text mode
327 because then the original file contents aren't important (e.g., when
328 lines are numbered by @code{cat}, or when line endings should be
329 marked). This is so these options work as DOS/Windows users would
330 expect; for example, DOS-style text files have their lines end with
331 the CR-LF pair of characters which won't be processed as an empty line
332 by @samp{-b} unless the file is read in text mode.
335 @itemx --number-nonblank
337 @opindex --number-nonblank
338 Number all nonblank output lines, starting with 1. On MS-DOS and
339 MS-Windows, this option causes @code{cat} to read and write files in
344 Equivalent to @samp{-vE}.
350 Display a @samp{$} after the end of each line. On MS-DOS and
351 MS-Windows, this option causes @code{cat} to read and write files in
358 Number all output lines, starting with 1. On MS-DOS and MS-Windows,
359 this option causes @code{cat} to read and write files in text mode.
362 @itemx --squeeze-blank
364 @opindex --squeeze-blank
365 @cindex squeezing blank lines
366 Replace multiple adjacent blank lines with a single blank line. On
367 MS-DOS and MS-Windows, this option causes @code{cat} to read and write
372 Equivalent to @samp{-vT}.
378 Display TAB characters as @samp{^I}.
382 Ignored; for Unix compatibility.
385 @itemx --show-nonprinting
387 @opindex --show-nonprinting
388 Display control characters except for LFD and TAB using
389 @samp{^} notation and precede characters that have the high bit set with
390 @samp{M-}. On MS-DOS and MS-Windows, this option causes @code{cat} to
391 read files and standard input in DOS binary mode, so the CR
392 characters at the end of each line are also visible.
398 @section @code{tac}: Concatenate and write files in reverse
401 @cindex reversing files
403 @code{tac} copies each @var{file} (@samp{-} means standard input), or
404 standard input if none are given, to standard output, reversing the
405 records (lines by default) in each separately. Synopsis:
408 tac [@var{option}]@dots{} [@var{file}]@dots{}
411 @dfn{Records} are separated by instances of a string (newline by
412 default). By default, this separator string is attached to the end of
413 the record that it follows in the file.
415 The program accepts the following options. Also see @ref{Common options}.
423 The separator is attached to the beginning of the record that it
424 precedes in the file.
430 Treat the separator string as a regular expression. Users of @code{tac}
431 on MS-DOS/MS-Windows should note that, since @code{tac} reads files in
432 binary mode, each line of a text file might end with a CR/LF pair
433 instead of the Unix-style LF.
435 @item -s @var{separator}
436 @itemx --separator=@var{separator}
439 Use @var{separator} as the record separator, instead of newline.
445 @section @code{nl}: Number lines and write files
448 @cindex numbering lines
449 @cindex line numbering
451 @code{nl} writes each @var{file} (@samp{-} means standard input), or
452 standard input if none are given, to standard output, with line numbers
453 added to some or all of the lines. Synopsis:
456 nl [@var{option}]@dots{} [@var{file}]@dots{}
459 @cindex logical pages, numbering on
460 @code{nl} decomposes its input into (logical) pages; by default, the
461 line number is reset to 1 at the top of each logical page. @code{nl}
462 treats all of the input files as a single document; it does not reset
463 line numbers or logical pages between files.
465 @cindex headers, numbering
466 @cindex body, numbering
467 @cindex footers, numbering
468 A logical page consists of three sections: header, body, and footer.
469 Any of the sections can be empty. Each can be numbered in a different
470 style from the others.
472 The beginnings of the sections of logical pages are indicated in the
473 input file by a line containing exactly one of these delimiter strings:
484 The two characters from which these strings are made can be changed from
485 @samp{\} and @samp{:} via options (see below), but the pattern and
486 length of each string cannot be changed.
488 A section delimiter is replaced by an empty line on output. Any text
489 that comes before the first section delimiter string in the input file
490 is considered to be part of a body section, so @code{nl} treats a
491 file that contains no section delimiters as a single body section.
493 The program accepts the following options. Also see @ref{Common options}.
498 @itemx --body-numbering=@var{style}
500 @opindex --body-numbering
501 Select the numbering style for lines in the body section of each
502 logical page. When a line is not numbered, the current line number
503 is not incremented, but the line number separator character is still
504 prepended to the line. The styles are:
510 number only nonempty lines (default for body),
512 do not number lines (default for header and footer),
514 number only lines that contain a match for @var{regexp}.
518 @itemx --section-delimiter=@var{cd}
520 @opindex --section-delimiter
521 @cindex section delimiters of pages
522 Set the section delimiter characters to @var{cd}; default is
523 @samp{\:}. If only @var{c} is given, the second remains @samp{:}.
524 (Remember to protect @samp{\} or other metacharacters from shell
525 expansion with quotes or extra backslashes.)
528 @itemx --footer-numbering=@var{style}
530 @opindex --footer-numbering
531 Analogous to @samp{--body-numbering}.
534 @itemx --header-numbering=@var{style}
536 @opindex --header-numbering
537 Analogous to @samp{--body-numbering}.
539 @item -i @var{number}
540 @itemx --page-increment=@var{number}
542 @opindex --page-increment
543 Increment line numbers by @var{number} (default 1).
545 @item -l @var{number}
546 @itemx --join-blank-lines=@var{number}
548 @opindex --join-blank-lines
549 @cindex empty lines, numbering
550 @cindex blank lines, numbering
551 Consider @var{number} (default 1) consecutive empty lines to be one
552 logical line for numbering, and only number the last one. Where fewer
553 than @var{number} consecutive empty lines occur, do not number them.
554 An empty line is one that contains no characters, not even spaces
557 @item -n @var{format}
558 @itemx --number-format=@var{format}
560 @opindex --number-format
561 Select the line numbering format (default is @code{rn}):
565 @opindex ln @r{format for @code{nl}}
566 left justified, no leading zeros;
568 @opindex rn @r{format for @code{nl}}
569 right justified, no leading zeros;
571 @opindex rz @r{format for @code{nl}}
572 right justified, leading zeros.
578 @opindex --no-renumber
579 Do not reset the line number at the start of a logical page.
581 @item -s @var{string}
582 @itemx --number-separator=@var{string}
584 @opindex --number-separator
585 Separate the line number from the text line in the output with
586 @var{string} (default is the TAB character).
588 @item -v @var{number}
589 @itemx --starting-line-number=@var{number}
591 @opindex --starting-line-number
592 Set the initial line number on each logical page to @var{number} (default 1).
594 @item -w @var{number}
595 @itemx --number-width=@var{number}
597 @opindex --number-width
598 Use @var{number} characters for line numbers (default 6).
604 @section @code{od}: Write files in octal or other formats
607 @cindex octal dump of files
608 @cindex hex dump of files
609 @cindex ASCII dump of files
610 @cindex file contents, dumping unambiguously
612 @code{od} writes an unambiguous representation of each @var{file}
613 (@samp{-} means standard input), or standard input if none are given.
617 od [@var{option}]@dots{} [@var{file}]@dots{}
618 od -C [@var{file}] [[+]@var{offset} [[+]@var{label}]]
621 Each line of output consists of the offset in the input, followed by
622 groups of data from the file. By default, @code{od} prints the offset in
623 octal, and each group of file data is two bytes of input printed as a
626 The program accepts the following options. Also see @ref{Common options}.
631 @itemx --address-radix=@var{radix}
633 @opindex --address-radix
634 @cindex radix for file offsets
635 @cindex file offset radix
636 Select the base in which file offsets are printed. @var{radix} can
637 be one of the following:
647 none (do not print offsets).
650 The default is octal.
653 @itemx --skip-bytes=@var{bytes}
655 @opindex --skip-bytes
656 Skip @var{bytes} input bytes before formatting and writing. If
657 @var{bytes} begins with @samp{0x} or @samp{0X}, it is interpreted in
658 hexadecimal; otherwise, if it begins with @samp{0}, in octal; otherwise,
659 in decimal. Appending @samp{b} multiplies @var{bytes} by 512, @samp{k}
660 by 1024, and @samp{m} by 1048576.
663 @itemx --read-bytes=@var{bytes}
665 @opindex --read-bytes
666 Output at most @var{bytes} bytes of the input. Prefixes and suffixes on
667 @code{bytes} are interpreted as for the @samp{-j} option.
670 @itemx --strings[=@var{n}]
673 @cindex string constants, outputting
674 Instead of the normal output, output only @dfn{string constants}: at
675 least @var{n} (3 by default) consecutive @sc{ascii} graphic characters,
676 followed by a null (zero) byte.
679 @itemx --format=@var{type}
682 Select the format in which to output the file data. @var{type} is a
683 string of one or more of the below type indicator characters. If you
684 include more than one type indicator character in a single @var{type}
685 string, or use this option more than once, @code{od} writes one copy
686 of each output line using each of the data types that you specified,
687 in the order that you specified.
689 Adding a trailing ``z'' to any type specification appends a display
690 of the @sc{ascii} character representation of the printable characters
691 to the output line generated by the type specification.
697 @sc{ascii} character or backslash escape,
710 The type @code{a} outputs things like @samp{sp} for space, @samp{nl} for
711 newline, and @samp{nul} for a null (zero) byte. Type @code{c} outputs
712 @samp{ }, @samp{\n}, and @code{\0}, respectively.
715 Except for types @samp{a} and @samp{c}, you can specify the number
716 of bytes to use in interpreting each number in the given data type
717 by following the type indicator character with a decimal integer.
718 Alternately, you can specify the size of one of the C compiler's
719 built-in data types by following the type indicator character with
720 one of the following characters. For integers (@samp{d}, @samp{o},
734 For floating point (@code{f}):
746 @itemx --output-duplicates
748 @opindex --output-duplicates
749 Output consecutive lines that are identical. By default, when two or
750 more consecutive output lines would be identical, @code{od} outputs only
751 the first line, and puts just an asterisk on the following line to
752 indicate the elision.
755 @itemx --width[=@var{n}]
758 Dump @code{n} input bytes per output line. This must be a multiple of
759 the least common multiple of the sizes associated with the specified
760 output types. If @var{n} is omitted, the default is 32. If this option
761 is not given at all, the default is 16.
765 The next several options map the old, pre-@sc{posix} format specification
766 options to the corresponding @sc{posix} format specs. GNU @code{od} accepts
767 any combination of old- and new-style options. Format specification
774 Output as named characters. Equivalent to @samp{-ta}.
778 Output as octal bytes. Equivalent to @samp{-toC}.
782 Output as @sc{ascii} characters or backslash escapes. Equivalent to
787 Output as unsigned decimal shorts. Equivalent to @samp{-tu2}.
791 Output as floats. Equivalent to @samp{-tfF}.
795 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
799 Output as decimal shorts. Equivalent to @samp{-td2}.
803 Output as decimal longs. Equivalent to @samp{-td4}.
807 Output as octal shorts. Equivalent to @samp{-to2}.
811 Output as hexadecimal shorts. Equivalent to @samp{-tx2}.
815 @opindex --traditional
816 Recognize the pre-POSIX non-option arguments that traditional @code{od}
817 accepted. The following syntax:
820 od --traditional [@var{file}] [[+]@var{offset}[.][b] [[+]@var{label}[.][b]]]
824 can be used to specify at most one file and optional arguments
825 specifying an offset and a pseudo-start address, @var{label}. By
826 default, @var{offset} is interpreted as an octal number specifying how
827 many input bytes to skip before formatting and writing. The optional
828 trailing decimal point forces the interpretation of @var{offset} as a
829 decimal number. If no decimal is specified and the offset begins with
830 @samp{0x} or @samp{0X} it is interpreted as a hexadecimal number. If
831 there is a trailing @samp{b}, the number of bytes skipped will be
832 @var{offset} multiplied by 512. The @var{label} argument is interpreted
833 just like @var{offset}, but it specifies an initial pseudo-address. The
834 pseudo-addresses are displayed in parentheses following any normal
840 @node Formatting file contents
841 @chapter Formatting file contents
843 @cindex formatting file contents
845 These commands reformat the contents of files.
848 * fmt invocation:: Reformat paragraph text.
849 * pr invocation:: Paginate or columnate files for printing.
850 * fold invocation:: Wrap input lines to fit in specified width.
855 @section @code{fmt}: Reformat paragraph text
858 @cindex reformatting paragraph text
859 @cindex paragraphs, reformatting
860 @cindex text, reformatting
862 @code{fmt} fills and joins lines to produce output lines of (at most)
863 a given number of characters (75 by default). Synopsis:
866 fmt [@var{option}]@dots{} [@var{file}]@dots{}
869 @code{fmt} reads from the specified @var{file} arguments (or standard
870 input if none are given), and writes to standard output.
872 By default, blank lines, spaces between words, and indentation are
873 preserved in the output; successive input lines with different
874 indentation are not joined; tabs are expanded on input and introduced on
877 @cindex line-breaking
878 @cindex sentences and line-breaking
879 @cindex Knuth, Donald E.
880 @cindex Plass, Michael F.
881 @code{fmt} prefers breaking lines at the end of a sentence, and tries to
882 avoid line breaks after the first word of a sentence or before the last
883 word of a sentence. A @dfn{sentence break} is defined as either the end
884 of a paragraph or a word ending in any of @samp{.?!}, followed by two
885 spaces or end of line, ignoring any intervening parentheses or quotes.
886 Like @TeX{}, @code{fmt} reads entire ``paragraphs'' before choosing line
887 breaks; the algorithm is a variant of that in ``Breaking Paragraphs Into
888 Lines'' (Donald E. Knuth and Michael F. Plass, @cite{Software---Practice
889 and Experience}, 11 (1981), 1119--1184).
891 The program accepts the following options. Also see @ref{Common options}.
896 @itemx --crown-margin
898 @opindex --crown-margin
900 @dfn{Crown margin} mode: preserve the indentation of the first two
901 lines within a paragraph, and align the left margin of each subsequent
902 line with that of the second line.
905 @itemx --tagged-paragraph
907 @opindex --tagged-paragraph
908 @cindex tagged paragraphs
909 @dfn{Tagged paragraph} mode: like crown margin mode, except that if
910 indentation of the first line of a paragraph is the same as the
911 indentation of the second, the first line is treated as a one-line
917 @opindex --split-only
918 Split lines only. Do not join short lines to form longer ones. This
919 prevents sample lines of code, and other such ``formatted'' text from
920 being unduly combined.
923 @itemx --uniform-spacing
925 @opindex --uniform-spacing
926 Uniform spacing. Reduce spacing between words to one space, and spacing
927 between sentences to two spaces.
930 @itemx -w @var{width}
931 @itemx --width=@var{width}
932 @opindex -@var{width}
935 Fill output lines up to @var{width} characters (default 75). @code{fmt}
936 initially tries to make lines about 7% shorter than this, to give it
937 room to balance line lengths.
939 @item -p @var{prefix}
940 @itemx --prefix=@var{prefix}
941 Only lines beginning with @var{prefix} (possibly preceded by whitespace)
942 are subject to formatting. The prefix and any preceding whitespace are
943 stripped for the formatting and then re-attached to each formatted output
944 line. One use is to format certain kinds of program comments, while
945 leaving the code unchanged.
951 @section @code{pr}: Paginate or columnate files for printing
954 @cindex printing, preparing files for
955 @cindex multicolumn output, generating
956 @cindex merging files in parallel
958 @code{pr} writes each @var{file} (@samp{-} means standard input), or
959 standard input if none are given, to standard output, paginating and
960 optionally outputting in multicolumn format; optionally merges all
961 @var{file}s, printing all in parallel, one per column. Synopsis:
964 pr [@var{option}]@dots{} [@var{file}]@dots{}
967 By default, a 5-line header is printed at each page: two blank lines;
968 a line with the date, the filename, and the page count; and two more
969 blank lines. A footer of five blank lines is also printed. With the @samp{-F}
970 option, a 3-line header is printed: the leading two blank lines are
971 omitted; no footer is used. The default @var{page_length} in both cases is 66
972 lines. The default number of text lines changes from 56 (without @samp{-F})
973 to 63 (with @samp{-F}). The text line of the header takes up the full
974 @var{page_width} in the form @samp{yyyy-mm-dd HH:MM string Page nnnn}.
975 String is a centered header string.
977 Form feeds in the input cause page breaks in the output. Multiple form
978 feeds produce empty pages.
980 Columns are of equal width, separated by an optional string (default
981 is @samp{space}). For multicolumn output, lines will always be truncated to
982 @var{page_width} (default 72), unless you use the @samp{-J} option. For single
983 column output no line truncation occurs by default. Use @samp{-W} option to
984 truncate lines in that case.
986 Including version 1.22i:
988 Some small @var{letter options} (@samp{-s}, @samp{-w}) has been redefined
989 with the object of a better @var{posix} compliance. The output of some
990 further cases has been adapted to other @var{unix}es. A violation of
991 downward compatibility has to be accepted.
993 Some @var{new capital letter} options (@samp{-J}, @samp{-S}, @samp{-W})
994 has been introduced to turn off unexpected interferences of small letter
995 options. The @samp{-N} option and the second argument @var{last_page}
996 of @samp{+FIRST_PAGE} offer more flexibility. The detailed handling of
997 form feeds set in the input files requires @samp{-T} option.
999 Capital letter options dominate small letter ones.
1001 Some of the option-arguments (compare @samp{-s}, @samp{-S}, @samp{-e},
1002 @samp{-i}, @samp{-n}) cannot be specified as separate arguments from the
1003 preceding option letter (already stated in the @var{posix} specification).
1005 The program accepts the following options. Also see @ref{Common options}.
1009 @item +@var{first_page}[:@var{last_page}]
1010 @itemx --pages=@var{first_page}[:@var{last_page}]
1011 @opindex +@var{first_page}[:@var{last_page}]
1013 Begin printing with page @var{first_page} and stop with @var{last_page}.
1014 Missing @samp{:@var{last_page}} implies end of file. While estimating
1015 the number of skipped pages each form feed in the input file results
1016 in a new page. Page counting with and without @samp{+@var{first_page}}
1017 is identical. By default, counting starts with the first page of input
1018 file (not first page printed). Line numbering may be altered by @samp{-N}
1022 @itemx --columns=@var{column}
1023 @opindex -@var{column}
1025 @cindex down columns
1026 With each single @var{file}, produce @var{column} columns of output
1027 (default is 1) and print columns down, unless @samp{-a} is used. The
1028 column width is automatically decreased as @var{column} increases; unless
1029 you use the @samp{-W/-w} option to increase @var{page_width} as well.
1030 This option might well cause some lines to be truncated. The number of
1031 lines in the columns on each page are balanced. The options @samp{-e}
1032 and @samp{-i} are on for multiple text-column output. Together with
1033 @samp{-J} option column alignment and line truncation is turned off.
1034 Lines of full length are joined in a free field format and @samp{-S}
1035 option may set field separators. @samp{-@var{column}} may not be used
1036 with @samp{-m} option.
1042 @cindex across columns
1043 With each single @var{file}, print columns across rather than down. The
1044 @samp{-@var{column}} option must be given with @var{column} greater than one.
1045 If a line is too long to fit in a column, it is truncated.
1048 @itemx --show-control-chars
1050 @opindex --show-control-chars
1051 Print control characters using hat notation (e.g., @samp{^G}); print
1052 other unprintable characters in octal backslash notation. By default,
1053 unprintable characters are not changed.
1056 @itemx --double-space
1058 @opindex --double-space
1059 @cindex double spacing
1060 Double space the output.
1062 @item -e[@var{in-tabchar}[@var{in-tabwidth}]]
1063 @itemx --expand-tabs[=@var{in-tabchar}[@var{in-tabwidth}]]
1065 @opindex --expand-tabs
1067 Expand @var{tab}s to spaces on input. Optional argument @var{in-tabchar} is
1068 the input tab character (default is the TAB character). Second optional
1069 argument @var{in-tabwidth} is the input tab character's width (default
1077 @opindex --form-feed
1078 Use a form feed instead of newlines to separate output pages. The default
1079 page length of 66 lines is not altered. But the number of lines of text
1080 per page changes from default 56 to 63 lines.
1082 @item -h @var{HEADER}
1083 @itemx --header=@var{HEADER}
1086 Replace the filename in the header with the centered string @var{header}.
1087 Left-hand-side truncation (marked by a @samp{*}) may occur if the total
1088 header line @samp{yyyy-mm-dd HH:MM HEADER Page nnnn} becomes larger than
1089 @var{page_width}. @samp{-h ""} prints a blank line header. Don't use
1091 A space between the @samp{-h} option and the argument is always
1094 @item -i[@var{out-tabchar}[@var{out-tabwidth}]]
1095 @itemx --output-tabs[=@var{out-tabchar}[@var{out-tabwidth}]]
1097 @opindex --output-tabs
1099 Replace spaces with @var{tab}s on output. Optional argument @var{out-tabchar}
1100 is the output tab character (default is the TAB character). Second optional
1101 argument @var{out-tabwidth} is the output tab character's width (default
1107 @opindex --join-lines
1108 Merge lines of full length. Used together with the column options
1109 @samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}. Turns off
1110 @samp{-W/-w} line truncation;
1111 no column alignment used; may be used with @samp{-S[@var{string}]}.
1112 @samp{-J} has been introduced (together with @samp{-W} and @samp{-S})
1113 to disentangle the old (@var{posix} compliant) options @samp{-w} and
1114 @samp{-s} along with the three column options.
1117 @item -l @var{page_length}
1118 @itemx --length=@var{page_length}
1121 Set the page length to @var{page_length} (default 66) lines, including
1122 the lines of the header [and the footer]. If @var{page_length} is less
1123 than or equal 10 (and <= 3 with @samp{-F}), the header and footer are
1124 omitted, and all form feeds set in input files are eliminated, as if
1125 the @samp{-T} option had been given.
1131 Merge and print all @var{file}s in parallel, one in each column. If a
1132 line is too long to fit in a column, it is truncated, unless @samp{-J}
1133 option is used. @samp{-S[@var{string}]} may be used. Empty pages in
1134 some @var{file}s (form feeds set) produce empty columns, still marked
1135 by @var{string}. The result is a continuous line numbering and column
1136 marking throughout the whole merged file. Completely empty merged pages
1137 show no separators or line numbers. The default header becomes
1138 @samp{yyyy-mm-dd HH:MM <blanks> Page nnnn}; may be used with
1139 @samp{-h @var{header}} to fill up the middle blank part.
1141 @item -n[@var{number-separator}[@var{digits}]]
1142 @itemx --number-lines[=@var{number-separator}[@var{digits}]]
1144 @opindex --number-lines
1145 Provide @var{digits} digit line numbering (default for @var{digits} is
1146 5). With multicolumn output the number occupies the first @var{digits}
1147 column positions of each text column or only each line of @samp{-m}
1148 output. With single column output the number precedes each line just as
1149 @samp{-m} does. Default counting of the line numbers starts with 1st
1150 line of the input file (not the 1st line printed, compare the
1151 @samp{--page} option and @samp{-N} option).
1152 Optional argument @var{number-separator} is the character appended to
1153 the line number to separate it from the text followed. The default
1154 separator is the TAB character. In a strict sense a TAB is always
1155 printed with single column output only. The @var{TAB}-width varies
1156 with the @var{TAB}-position, e.g. with the left @var{margin} specified
1157 by @samp{-o} option. With multicolumn output priority is given to
1158 @samp{equal width of output columns} (a @var{posix} specification).
1159 The @var{TAB}-width is fixed to the value of the 1st column and does
1160 not change with different values of left @var{margin}. That means a
1161 fixed number of spaces is always printed in the place of the
1162 @var{number-separator tab}. The tabification depends upon the output
1165 @item -N @var{line_number}
1166 @itemx --first-line-number=@var{line_number}
1168 @opindex --first-line-number
1169 Start line counting with the number @var{line_number} at first line of
1170 first page printed (in most cases not the first line of the input file).
1172 @item -o @var{margin}
1173 @itemx --indent=@var{margin}
1176 @cindex indenting lines
1178 Indent each line with a margin @var{margin} spaces wide (default is zero).
1179 The total page width is the size of the margin plus the @var{page_width}
1180 set with the @samp{-W/-w} option. A limited overflow may occur with
1181 numbered single column output (compare @samp{-n} option).
1184 @itemx --no-file-warnings
1186 @opindex --no-file-warnings
1187 Do not print a warning message when an argument @var{file} cannot be
1188 opened. (The exit status will still be nonzero, however.)
1190 @item -s[@var{char}]
1191 @itemx --separator[=@var{char}]
1193 @opindex --separator
1194 Separate columns by a single character @var{char}. Default for @var{char}
1195 is the TAB character without @samp{-w} and @samp{no character} with
1196 @samp{-w}. Without @samp{-s} default separator @samp{space} is set.
1197 @samp{-s[char]} turns off line truncation of all three column options
1198 (@samp{-COLUMN}|@samp{-a -COLUMN}|@samp{-m}) except @samp{-w} is set.
1199 That is a @var{posix} compliant formulation.
1202 @item -S[@var{string}]
1203 @itemx --sep-string[=@var{string}]
1205 @opindex --sep-string
1206 Use @var{string} to separate output columns. The @samp{-S} option doesn't
1207 affect the @samp{-W/-w} option, unlike the @samp{-s} option which does. It
1208 does not affect line truncation or column alignment.
1209 Without @samp{-S}, and with @samp{-J}, @code{pr} uses the default output
1211 Without @samp{-S} or @samp{-J}, @code{pr} uses a @samp{space}
1212 (same as @samp{-S" "}).
1213 Using @samp{-S} with no @var{string} is equivalent to @samp{-S""}.
1214 Note that for some of @code{pr}'s options the single-letter option
1215 character must be followed immediately by any corresponding argument;
1216 there may not be any intervening white space.
1217 @samp{-S/-s} is one of them. Don't use @samp{-S "STRING"}.
1218 @sc{posix} requires this.
1221 @itemx --omit-header
1223 @opindex --omit-header
1224 Do not print the usual header [and footer] on each page, and do not fill
1225 out the bottom of pages (with blank lines or a form feed). No page
1226 structure is produced, but form feeds set in the input files are retained.
1227 The predefined pagination is not changed. @samp{-t} or @samp{-T} may be
1228 useful together with other options; e.g.: @samp{-t -e4}, expand TAB characters
1229 in the input file to 4 spaces but don't make any other changes. Use of
1230 @samp{-t} overrides @samp{-h}.
1233 @itemx --omit-pagination
1235 @opindex --omit-pagination
1236 Do not print header [and footer]. In addition eliminate all form feeds
1237 set in the input files.
1240 @itemx --show-nonprinting
1242 @opindex --show-nonprinting
1243 Print unprintable characters in octal backslash notation.
1245 @item -w @var{page_width}
1246 @itemx --width=@var{page_width}
1249 Set page width to @var{page_width} characters for multiple text-column
1250 output only (default for @var{page_width} is 72). @samp{-s[CHAR]} turns
1251 off the default page width and any line truncation and column alignment.
1252 Lines of full length are merged, regardless of the column options
1253 set. No @var{page_width} setting is possible with single column output.
1254 A @var{posix} compliant formulation.
1256 @item -W @var{page_width}
1257 @itemx --page_width=@var{page_width}
1259 @opindex --page_width
1260 Set the page width to @var{page_width} characters. That's valid with and
1261 without a column option. Text lines are truncated, unless @samp{-J}
1262 is used. Together with one of the three column options
1263 (@samp{-@var{column}}, @samp{-a -@var{column}} or @samp{-m}) column
1264 alignment is always used. The separator options @samp{-S} or @samp{-s}
1265 don't affect the @samp{-W} option. Default is 72 characters. Without
1266 @samp{-W @var{page_width}} and without any of the column options NO line
1267 truncation is used (defined to keep downward compatibility and to meet
1268 most frequent tasks). That's equivalent to @samp{-W 72 -J}. With and
1269 without @samp{-W @var{page_width}} the header line is always truncated
1270 to avoid line overflow.
1275 @node fold invocation
1276 @section @code{fold}: Wrap input lines to fit in specified width
1279 @cindex wrapping long input lines
1280 @cindex folding long input lines
1282 @code{fold} writes each @var{file} (@samp{-} means standard input), or
1283 standard input if none are given, to standard output, breaking long
1287 fold [@var{option}]@dots{} [@var{file}]@dots{}
1290 By default, @code{fold} breaks lines wider than 80 columns. The output
1291 is split into as many lines as necessary.
1293 @cindex screen columns
1294 @code{fold} counts screen columns by default; thus, a tab may count more
1295 than one column, backspace decreases the column count, and carriage
1296 return sets the column to zero.
1298 The program accepts the following options. Also see @ref{Common options}.
1306 Count bytes rather than columns, so that tabs, backspaces, and carriage
1307 returns are each counted as taking up one column, just like other
1314 Break at word boundaries: the line is broken after the last blank before
1315 the maximum line length. If the line contains no such blanks, the line
1316 is broken at the maximum line length as usual.
1318 @item -w @var{width}
1319 @itemx --width=@var{width}
1322 Use a maximum line length of @var{width} columns instead of 80.
1327 @node Output of parts of files
1328 @chapter Output of parts of files
1330 @cindex output of parts of files
1331 @cindex parts of files, output of
1333 These commands output pieces of the input.
1336 * head invocation:: Output the first part of files.
1337 * tail invocation:: Output the last part of files.
1338 * split invocation:: Split a file into fixed-size pieces.
1339 * csplit invocation:: Split a file into context-determined pieces.
1342 @node head invocation
1343 @section @code{head}: Output the first part of files
1346 @cindex initial part of files, outputting
1347 @cindex first part of files, outputting
1349 @code{head} prints the first part (10 lines by default) of each
1350 @var{file}; it reads from standard input if no files are given or
1351 when given a @var{file} of @samp{-}. Synopses:
1354 head [@var{option}]@dots{} [@var{file}]@dots{}
1355 head -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1358 If more than one @var{file} is specified, @code{head} prints a
1359 one-line header consisting of
1361 ==> @var{file name} <==
1364 before the output for each @var{file}.
1366 @code{head} accepts two option formats: the new one, in which numbers
1367 are arguments to the options (@samp{-q -n 1}), and the old one, in which
1368 the number precedes any option letters (@samp{-1q}).
1370 The program accepts the following options. Also see @ref{Common options}.
1374 @item -@var{count}@var{options}
1375 @opindex -@var{count}
1376 This option is only recognized if it is specified first. @var{count} is
1377 a decimal number optionally followed by a size letter (@samp{b},
1378 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1379 or other option letters (@samp{cqv}).
1381 @item -c @var{bytes}
1382 @itemx --bytes=@var{bytes}
1385 Print the first @var{bytes} bytes, instead of initial lines. Appending
1386 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1390 @itemx --lines=@var{n}
1393 Output the first @var{n} lines.
1401 Never print file name headers.
1407 Always print file name headers.
1412 @node tail invocation
1413 @section @code{tail}: Output the last part of files
1416 @cindex last part of files, outputting
1418 @code{tail} prints the last part (10 lines by default) of each
1419 @var{file}; it reads from standard input if no files are given or
1420 when given a @var{file} of @samp{-}. Synopses:
1423 tail [@var{option}]@dots{} [@var{file}]@dots{}
1424 tail -@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1425 tail +@var{number} [@var{option}]@dots{} [@var{file}]@dots{}
1428 If more than one @var{file} is specified, @code{tail} prints a
1429 one-line header consisting of
1431 ==> @var{file name} <==
1434 before the output for each @var{file}.
1436 @cindex BSD @code{tail}
1437 GNU @code{tail} can output any amount of data (some other versions of
1438 @code{tail} cannot). It also has no @samp{-r} option (print in
1439 reverse), since reversing a file is really a different job from printing
1440 the end of a file; BSD @code{tail} (which is the one with @code{-r}) can
1441 only reverse files that are at most as large as its buffer, which is
1442 typically 32k. A more reliable and versatile way to reverse files is
1443 the GNU @code{tac} command.
1445 @code{tail} accepts two option formats: the new one, in which numbers
1446 are arguments to the options (@samp{-n 1}), and the old one, in which
1447 the number precedes any option letters (@samp{-1} or @samp{+1}).
1449 If any option-argument is a number @var{n} starting with a @samp{+},
1450 @code{tail} begins printing with the @var{n}th item from the start of
1451 each file, instead of from the end.
1453 The program accepts the following options. Also see @ref{Common options}.
1459 @opindex -@var{count}
1460 @opindex +@var{count}
1461 This option is only recognized if it is specified first. @var{count} is
1462 a decimal number optionally followed by a size letter (@samp{b},
1463 @samp{k}, @samp{m}) as in @code{-c}, or @samp{l} to mean count by lines,
1464 or other option letters (@samp{cfqv}).
1466 @item -c @var{bytes}
1467 @itemx --bytes=@var{bytes}
1470 Output the last @var{bytes} bytes, instead of final lines. Appending
1471 @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and @samp{m}
1475 @itemx --follow[=@var{how}]
1478 @cindex growing files
1479 @vindex name @r{follow option}
1480 @vindex descriptor @r{follow option}
1481 Loop forever trying to read more characters at the end of the file,
1482 presumably because the file is growing. This option is ignored when
1483 reading from a pipe.
1484 If more than one file is given, @code{tail} prints a header whenever it
1485 gets output from a different file, to indicate which file that output is
1488 There are two ways to specify how you'd like to track files with this option,
1489 but that difference is noticeable only when a followed file is removed or
1491 If you'd like to continue to track the end of a growing file even after
1492 it has been unlinked, use @samp{--follow=descriptor}. This is the default
1493 behavior, but it is not useful if you're tracking a log file that may be
1494 rotated (removed or renamed, then reopened). In that case, use
1495 @samp{--follow=name} to track the named file by reopening it periodically
1496 to see if it has been removed and recreated by some other program.
1498 No matter which method you use, if the tracked file is determined to have
1499 shrunk, @code{tail} prints a message saying the file has been truncated
1500 and resumes tracking the end of the file from the newly-determined endpoint.
1502 When a file is removed, @code{tail}'s behavior depends on whether it is
1503 following the name or the descriptor. When following by name, tail can
1504 detect that a file has been removed and gives a message to that effect,
1505 and if @samp{--retry} has been specified it will continue checking
1506 periodically to see if the file reappears.
1507 When following a descriptor, tail does not detect that the file has
1508 been unlinked or renamed and issues no message; even though the file
1509 may no longer be accessible via its original name, it may still be
1512 The option values @samp{descriptor} and @samp{name} may be specified only
1513 with the long form of the option, not with @samp{-f}.
1517 This option is meaningful only when following by name.
1518 Without this option, when tail encounters a file that doesn't
1519 exist or is otherwise inaccessible, it reports that fact and
1520 never checks it again.
1522 @itemx --sleep-interval=@var{n}
1523 @opindex --sleep-interval
1524 Change the number of seconds to wait between iterations (the default is 1).
1525 During one iteration, every specified file is checked to see if it has
1528 @itemx --pid=@var{pid}
1530 When following by name or by descriptor, you may specify the process ID,
1531 @var{pid}, of the sole writer of all @var{file} arguments. Then, shortly
1532 after that process terminates, tail will also terminate. This will
1533 work properly only if the writer and the tailing process are running on
1534 the same machine. For example, to save the output of a build in a file
1535 and to watch the file grow, if you invoke @code{make} and @code{tail}
1536 like this then the tail process will stop when your build completes.
1537 Without this option, you would have had to kill the @code{tail -f}
1540 $ make >& makerr & tail --pid=$! -f makerr
1542 If you specify a @var{pid} that is not in use or that does not correspond
1543 to the process that is writing to the tailed files, then @code{tail}
1544 may terminate long before any @var{file}s stop growing or it may not
1545 terminate until long after the real writer has terminated.
1547 @itemx --max-consecutive-size-changes=@var{n}
1548 @opindex --max-consecutive-size-changes
1549 This option is meaningful only when following by name.
1550 Use it to control how long @code{tail} follows the descriptor of a file
1551 that continues growing at a rapid pace even after it is deleted or renamed.
1552 After detecting @var{n} consecutive size changes for a file,
1553 @code{open}/@code{fstat} the file to determine if that file name is
1554 still associated with the same device/inode-number pair as before.
1555 See the output of @code{tail --help} for the default value.
1557 @itemx --max-unchanged-stats=@var{n}
1558 @opindex --max-unchanged-stats
1559 When tailing a file by name, if there have been this many consecutive
1560 iterations for which the size has remained the same, then
1561 @code{open}/@code{fstat} the file to determine if that file name is
1562 still associated with the same device/inode-number pair as before.
1563 When following a log file that is rotated this is approximately the
1564 number of seconds between when tail prints the last pre-rotation lines
1565 and when it prints the lines that have accumulated in the new log file.
1566 See the output of @code{tail --help} for the default value.
1567 This option is meaningful only when following by name.
1570 @itemx --lines=@var{n}
1573 Output the last @var{n} lines.
1581 Never print file name headers.
1587 Always print file name headers.
1592 @node split invocation
1593 @section @code{split}: Split a file into fixed-size pieces
1596 @cindex splitting a file into pieces
1597 @cindex pieces, splitting a file into
1599 @code{split} creates output files containing consecutive sections of
1600 @var{input} (standard input if none is given or @var{input} is
1601 @samp{-}). Synopsis:
1604 split [@var{option}] [@var{input} [@var{prefix}]]
1607 By default, @code{split} puts 1000 lines of @var{input} (or whatever is
1608 left over for the last section), into each output file.
1610 @cindex output file name prefix
1611 The output files' names consist of @var{prefix} (@samp{x} by default)
1612 followed by a group of letters @samp{aa}, @samp{ab}, and so on, such
1613 that concatenating the output files in sorted order by file name produces
1614 the original input file. (If more than 676 output files are required,
1615 @code{split} uses @samp{zaa}, @samp{zab}, etc.)
1617 The program accepts the following options. Also see @ref{Common options}.
1622 @itemx -l @var{lines}
1623 @itemx --lines=@var{lines}
1626 Put @var{lines} lines of @var{input} into each output file.
1628 @item -b @var{bytes}
1629 @itemx --bytes=@var{bytes}
1632 Put the first @var{bytes} bytes of @var{input} into each output file.
1633 Appending @samp{b} multiplies @var{bytes} by 512, @samp{k} by 1024, and
1634 @samp{m} by 1048576.
1636 @item -C @var{bytes}
1637 @itemx --line-bytes=@var{bytes}
1639 @opindex --line-bytes
1640 Put into each output file as many complete lines of @var{input} as
1641 possible without exceeding @var{bytes} bytes. For lines longer than
1642 @var{bytes} bytes, put @var{bytes} bytes into each output file until
1643 less than @var{bytes} bytes of the line are left, then continue
1644 normally. @var{bytes} has the same format as for the @samp{--bytes}
1649 Write a diagnostic to standard error just before each output file is opened.
1654 @node csplit invocation
1655 @section @code{csplit}: Split a file into context-determined pieces
1658 @cindex context splitting
1659 @cindex splitting a file into pieces by context
1661 @code{csplit} creates zero or more output files containing sections of
1662 @var{input} (standard input if @var{input} is @samp{-}). Synopsis:
1665 csplit [@var{option}]@dots{} @var{input} @var{pattern}@dots{}
1668 The contents of the output files are determined by the @var{pattern}
1669 arguments, as detailed below. An error occurs if a @var{pattern}
1670 argument refers to a nonexistent line of the input file (e.g., if no
1671 remaining line matches a given regular expression). After every
1672 @var{pattern} has been matched, any remaining input is copied into one
1675 By default, @code{csplit} prints the number of bytes written to each
1676 output file after it has been created.
1678 The types of pattern arguments are:
1683 Create an output file containing the input up to but not including line
1684 @var{n} (a positive integer). If followed by a repeat count, also
1685 create an output file containing the next @var{line} lines of the input
1686 file once for each repeat.
1688 @item /@var{regexp}/[@var{offset}]
1689 Create an output file containing the current line up to (but not
1690 including) the next line of the input file that contains a match for
1691 @var{regexp}. The optional @var{offset} is a @samp{+} or @samp{-}
1692 followed by a positive integer. If it is given, the input up to the
1693 matching line plus or minus @var{offset} is put into the output file,
1694 and the line after that begins the next section of input.
1696 @item %@var{regexp}%[@var{offset}]
1697 Like the previous type, except that it does not create an output
1698 file, so that section of the input file is effectively ignored.
1700 @item @{@var{repeat-count}@}
1701 Repeat the previous pattern @var{repeat-count} additional
1702 times. @var{repeat-count} can either be a positive integer or an
1703 asterisk, meaning repeat as many times as necessary until the input is
1708 The output files' names consist of a prefix (@samp{xx} by default)
1709 followed by a suffix. By default, the suffix is an ascending sequence
1710 of two-digit decimal numbers from @samp{00} and up to @samp{99}. In any
1711 case, concatenating the output files in sorted order by filename
1712 produces the original input file.
1714 By default, if @code{csplit} encounters an error or receives a hangup,
1715 interrupt, quit, or terminate signal, it removes any output files
1716 that it has created so far before it exits.
1718 The program accepts the following options. Also see @ref{Common options}.
1722 @item -f @var{prefix}
1723 @itemx --prefix=@var{prefix}
1726 @cindex output file name prefix
1727 Use @var{prefix} as the output file name prefix.
1729 @item -b @var{suffix}
1730 @itemx --suffix=@var{suffix}
1733 @cindex output file name suffix
1734 Use @var{suffix} as the output file name suffix. When this option is
1735 specified, the suffix string must include exactly one
1736 @code{printf(3)}-style conversion specification, possibly including
1737 format specification flags, a field width, a precision specifications,
1738 or all of these kinds of modifiers. The format letter must convert a
1739 binary integer argument to readable form; thus, only @samp{d}, @samp{i},
1740 @samp{u}, @samp{o}, @samp{x}, and @samp{X} conversions are allowed. The
1741 entire @var{suffix} is given (with the current output file number) to
1742 @code{sprintf(3)} to form the file name suffixes for each of the
1743 individual output files in turn. If this option is used, the
1744 @samp{--digits} option is ignored.
1746 @item -n @var{digits}
1747 @itemx --digits=@var{digits}
1750 Use output file names containing numbers that are @var{digits} digits
1751 long instead of the default 2.
1756 @opindex --keep-files
1757 Do not remove output files when errors are encountered.
1760 @itemx --elide-empty-files
1762 @opindex --elide-empty-files
1763 Suppress the generation of zero-length output files. (In cases where
1764 the section delimiters of the input file are supposed to mark the first
1765 lines of each of the sections, the first output file will generally be a
1766 zero-length file unless you use this option.) The output file sequence
1767 numbers always run consecutively starting from 0, even when this option
1778 Do not print counts of output file sizes.
1783 @node Summarizing files
1784 @chapter Summarizing files
1786 @cindex summarizing files
1788 These commands generate just a few numbers representing entire
1792 * wc invocation:: Print byte, word, and line counts.
1793 * sum invocation:: Print checksum and block counts.
1794 * cksum invocation:: Print CRC checksum and byte counts.
1795 * md5sum invocation:: Print or check message-digests.
1800 @section @code{wc}: Print byte, word, and line counts
1807 @code{wc} counts the number of bytes, whitespace-separated words, and
1808 newlines in each given @var{file}, or standard input if none are given
1809 or for a @var{file} of @samp{-}. Synopsis:
1812 wc [@var{option}]@dots{} [@var{file}]@dots{}
1815 @cindex total counts
1816 @code{wc} prints one line of counts for each file, and if the file was
1817 given as an argument, it prints the file name following the counts. If
1818 more than one @var{file} is given, @code{wc} prints a final line
1819 containing the cumulative counts, with the file name @file{total}. The
1820 counts are printed in this order: newlines, words, bytes.
1821 By default, each count is output right-justified in a 7-byte field with
1822 one space between fields so that the numbers and file names line up nicely
1823 in columns. However, POSIX requires that there be exactly one space
1824 separating columns. You can make @code{wc} use the POSIX-mandated
1825 output format by setting the @env{POSIXLY_CORRECT} environment variable.
1827 By default, @code{wc} prints all three counts. Options can specify
1828 that only certain counts be printed. Options do not undo others
1829 previously given, so
1836 prints both the byte counts and the word counts.
1838 With the @code{--max-line-length} option, @code{wc} prints the length
1839 of the longest line per file, and if there is more than one file it
1840 prints the maximum (not the sum) of those lengths.
1842 The program accepts the following options. Also see @ref{Common options}.
1852 Print only the byte counts.
1858 Print only the word counts.
1864 Print only the newline counts.
1867 @itemx --max-line-length
1869 @opindex --max-line-length
1870 Print only the maximum line lengths.
1875 @node sum invocation
1876 @section @code{sum}: Print checksum and block counts
1879 @cindex 16-bit checksum
1880 @cindex checksum, 16-bit
1882 @code{sum} computes a 16-bit checksum for each given @var{file}, or
1883 standard input if none are given or for a @var{file} of @samp{-}. Synopsis:
1886 sum [@var{option}]@dots{} [@var{file}]@dots{}
1889 @code{sum} prints the checksum for each @var{file} followed by the
1890 number of blocks in the file (rounded up). If more than one @var{file}
1891 is given, file names are also printed (by default). (With the
1892 @samp{--sysv} option, corresponding file name are printed when there is
1893 at least one file argument.)
1895 By default, GNU @code{sum} computes checksums using an algorithm
1896 compatible with BSD @code{sum} and prints file sizes in units of
1899 The program accepts the following options. Also see @ref{Common options}.
1905 @cindex BSD @code{sum}
1906 Use the default (BSD compatible) algorithm. This option is included for
1907 compatibility with the System V @code{sum}. Unless @samp{-s} was also
1908 given, it has no effect.
1914 @cindex System V @code{sum}
1915 Compute checksums using an algorithm compatible with System V
1916 @code{sum}'s default, and print file sizes in units of 512-byte blocks.
1920 @code{sum} is provided for compatibility; the @code{cksum} program (see
1921 next section) is preferable in new applications.
1924 @node cksum invocation
1925 @section @code{cksum}: Print CRC checksum and byte counts
1928 @cindex cyclic redundancy check
1929 @cindex CRC checksum
1931 @code{cksum} computes a cyclic redundancy check (CRC) checksum for each
1932 given @var{file}, or standard input if none are given or for a
1933 @var{file} of @samp{-}. Synopsis:
1936 cksum [@var{option}]@dots{} [@var{file}]@dots{}
1939 @code{cksum} prints the CRC checksum for each file along with the number
1940 of bytes in the file, and the filename unless no arguments were given.
1942 @code{cksum} is typically used to ensure that files
1943 transferred by unreliable means (e.g., netnews) have not been corrupted,
1944 by comparing the @code{cksum} output for the received files with the
1945 @code{cksum} output for the original files (typically given in the
1948 The CRC algorithm is specified by the @sc{posix.2} standard. It is not
1949 compatible with the BSD or System V @code{sum} algorithms (see the
1950 previous section); it is more robust.
1952 The only options are @samp{--help} and @samp{--version}. @xref{Common
1956 @node md5sum invocation
1957 @section @code{md5sum}: Print or check message-digests
1960 @cindex 128-bit checksum
1961 @cindex checksum, 128-bit
1962 @cindex fingerprint, 128-bit
1963 @cindex message-digest, 128-bit
1965 @code{md5sum} computes a 128-bit checksum (or @dfn{fingerprint} or
1966 @dfn{message-digest}) for each specified @var{file}.
1967 If a @var{file} is specified as @samp{-} or if no files are given
1968 @code{md5sum} computes the checksum for the standard input.
1969 @code{md5sum} can also determine whether a file and checksum are
1970 consistent. Synopses:
1973 md5sum [@var{option}]@dots{} [@var{file}]@dots{}
1974 md5sum [@var{option}]@dots{} --check [@var{file}]
1977 For each @var{file}, @samp{md5sum} outputs the MD5 checksum, a flag
1978 indicating a binary or text input file, and the filename.
1979 If @var{file} is omitted or specified as @samp{-}, standard input is read.
1981 The program accepts the following options. Also see @ref{Common options}.
1989 @cindex binary input files
1990 Treat all input files as binary. This option has no effect on Unix
1991 systems, since they don't distinguish between binary and text files.
1992 This option is useful on systems that have different internal and
1993 external character representations. On MS-DOS and MS-Windows, this is
1998 Read filenames and checksum information from the single @var{file}
1999 (or from stdin if no @var{file} was specified) and report whether
2000 each named file and the corresponding checksum data are consistent.
2001 The input to this mode of @code{md5sum} is usually the output of
2002 a prior, checksum-generating run of @samp{md5sum}.
2003 Each valid line of input consists of an MD5 checksum, a binary/text
2004 flag, and then a filename.
2005 Binary files are marked with @samp{*}, text with @samp{ }.
2006 For each such line, @code{md5sum} reads the named file and computes its
2007 MD5 checksum. Then, if the computed message digest does not match the
2008 one on the line with the filename, the file is noted as having
2009 failed the test. Otherwise, the file passes the test.
2010 By default, for each valid line, one line is written to standard
2011 output indicating whether the named file passed the test.
2012 After all checks have been performed, if there were any failures,
2013 a warning is issued to standard error.
2014 Use the @samp{--status} option to inhibit that output.
2015 If any listed file cannot be opened or read, if any valid line has
2016 an MD5 checksum inconsistent with the associated file, or if no valid
2017 line is found, @code{md5sum} exits with nonzero status. Otherwise,
2018 it exits successfully.
2022 @cindex verifying MD5 checksums
2023 This option is useful only when verifying checksums.
2024 When verifying checksums, don't generate the default one-line-per-file
2025 diagnostic and don't output the warning summarizing any failures.
2026 Failures to open or read a file still evoke individual diagnostics to
2028 If all listed files are readable and are consistent with the associated
2029 MD5 checksums, exit successfully. Otherwise exit with a status code
2030 indicating there was a failure.
2036 @cindex text input files
2037 Treat all input files as text files. This is the reverse of
2044 @cindex verifying MD5 checksums
2045 When verifying checksums, warn about improperly formatted MD5 checksum lines.
2046 This option is useful only if all but a few lines in the checked input
2052 @node Operating on sorted files
2053 @chapter Operating on sorted files
2055 @cindex operating on sorted files
2056 @cindex sorted files, operations on
2058 These commands work with (or produce) sorted files.
2061 * sort invocation:: Sort text files.
2062 * uniq invocation:: Uniquify files.
2063 * comm invocation:: Compare two sorted files line by line.
2064 * ptx invocation:: Produce a permuted index of file contents.
2065 * tsort invocation:: Topological sort.
2069 @node sort invocation
2070 @section @code{sort}: Sort text files
2073 @cindex sorting files
2075 @code{sort} sorts, merges, or compares all the lines from the given
2076 files, or standard input if none are given or for a @var{file} of
2077 @samp{-}. By default, @code{sort} writes the results to standard
2081 sort [@var{option}]@dots{} [@var{file}]@dots{}
2084 @code{sort} has three modes of operation: sort (the default), merge,
2085 and check for sortedness. The following options change the operation
2092 @cindex checking for sortedness
2093 Check whether the given files are already sorted: if they are not all
2094 sorted, print an error message and exit with a status of 1.
2095 Otherwise, exit successfully.
2099 @cindex merging sorted files
2100 Merge the given files by sorting them as a group. Each input file must
2101 always be individually sorted. It always works to sort instead of
2102 merge; merging is provided because it is faster, in the case where it
2108 A pair of lines is compared as follows: if any key fields have been
2109 specified, @code{sort} compares each pair of fields, in the order
2110 specified on the command line, according to the associated ordering
2111 options, until a difference is found or no fields are left.
2112 Unless otherwise specified, all comparisons use the character
2113 collating sequence specified by the @env{LC_COLLATE} locale.
2115 If any of the global options @samp{Mbdfinr} are given but no key fields
2116 are specified, @code{sort} compares the entire lines according to the
2119 Finally, as a last resort when all keys compare equal (or if no
2120 ordering options were specified at all), @code{sort} compares the entire
2121 lines. The last resort comparison
2122 honors the @samp{-r} global option. The @samp{-s} (stable) option
2123 disables this last-resort comparison so that lines in which all fields
2124 compare equal are left in their original relative order. If no fields
2125 or global options are specified, @samp{-s} has no effect.
2127 GNU @code{sort} (as specified for all GNU utilities) has no limits on
2128 input line length or restrictions on bytes allowed within lines. In
2129 addition, if the final byte of an input file is not a newline, GNU
2130 @code{sort} silently supplies one. A line's trailing newline is part of
2131 the line for comparison purposes; for example, with no options in an
2132 @sc{ascii} locale, a line starting with a tab sorts before an empty line
2133 because tab precedes newline in the @sc{ascii} collating sequence.
2135 Upon any error, @code{sort} exits with a status of @samp{2}.
2138 If the environment variable @env{TMPDIR} is set, @code{sort} uses its
2139 value as the directory for temporary files instead of @file{/tmp}. The
2140 @samp{-T @var{tempdir}} option in turn overrides the environment
2144 The following options affect the ordering of output lines. They may be
2145 specified globally or as part of a specific key field. If no key
2146 fields are specified, global options apply to comparison of entire
2147 lines; otherwise the global options are inherited by key fields that do
2148 not specify any special options of their own. The @samp{-b}, @samp{-d},
2149 @samp{-f} and @samp{-i} options classify characters according to
2150 the @env{LC_CTYPE} locale.
2156 @cindex blanks, ignoring leading
2157 Ignore leading blanks when finding sort keys in each line.
2161 @cindex phone directory order
2162 @cindex telephone directory order
2163 Sort in @dfn{phone directory} order: ignore all characters except
2164 letters, digits and blanks when sorting.
2168 @cindex case folding
2169 Fold lowercase characters into the equivalent uppercase characters when
2170 sorting so that, for example, @samp{b} and @samp{B} sort as equal.
2174 @cindex general numeric sort
2175 Sort numerically, using the standard C function @code{strtod} to convert
2176 a prefix of each line to a double-precision floating point number.
2177 This allows floating point numbers to be specified in scientific notation,
2178 like @code{1.0e-34} and @code{10e100}.
2179 Do not report overflow, underflow, or conversion errors.
2180 Use the following collating sequence:
2184 Lines that do not start with numbers (all considered to be equal).
2186 NaNs (``Not a Number'' values, in IEEE floating point arithmetic)
2187 in a consistent but machine-dependent order.
2191 Finite numbers in ascending numeric order (with @math{-0} and @math{+0} equal).
2196 Use this option only if there is no alternative; it is much slower than
2197 @samp{-n} and it can lose information when converting to floating point.
2201 @cindex unprintable characters, ignoring
2202 Ignore unprintable characters.
2206 @cindex months, sorting by
2208 An initial string, consisting of any amount of whitespace, followed
2209 by a month name abbreviation, is folded to UPPER case and
2210 compared in the order @samp{JAN} < @samp{FEB} < @dots{} < @samp{DEC}.
2211 Invalid names compare low to valid names. The @env{LC_TIME} locale
2212 determines the month spellings.
2216 @cindex numeric sort
2218 Sort numerically: the number begins each line; specifically, it consists
2219 of optional whitespace, an optional @samp{-} sign, and zero or more
2220 digits possibly separated by thousands separators, optionally followed
2221 by a radix character and zero or more digits. The @env{LC_NUMERIC}
2222 locale specifies the radix character and thousands separator.
2224 @code{sort -n} uses what might be considered an unconventional method
2225 to compare strings representing floating point numbers. Rather than
2226 first converting each string to the C @code{double} type and then
2227 comparing those values, sort aligns the radix characters in the two
2228 strings and compares the strings a character at a time. One benefit
2229 of using this approach is its speed. In practice this is much more
2230 efficient than performing the two corresponding string-to-double (or even
2231 string-to-integer) conversions and then comparing doubles. In addition,
2232 there is no corresponding loss of precision. Converting each string to
2233 @code{double} before comparison would limit precision to about 16 digits
2236 Neither a leading @samp{+} nor exponential notation is recognized.
2237 To compare such strings numerically, use the @samp{-g} option.
2241 @cindex reverse sorting
2242 Reverse the result of comparison, so that lines with greater key values
2243 appear earlier in the output instead of later.
2251 @item -o @var{output-file}
2253 @cindex overwriting of input, allowed
2254 Write output to @var{output-file} instead of standard output.
2255 If @var{output-file} is one of the input files, @code{sort} copies
2256 it to a temporary file before sorting and writing the output to
2259 @item -t @var{separator}
2261 @cindex field separator character
2262 Use character @var{separator} as the field separator when finding the
2263 sort keys in each line. By default, fields are separated by the empty
2264 string between a non-whitespace character and a whitespace character.
2265 That is, given the input line @w{@samp{ foo bar}}, @code{sort} breaks it
2266 into fields @w{@samp{ foo}} and @w{@samp{ bar}}. The field separator is
2267 not considered to be part of either the field preceding or the field
2272 @cindex uniquifying output
2273 For the default case or the @samp{-m} option, only output the first
2274 of a sequence of lines that compare equal. For the @samp{-c} option,
2275 check that no pair of consecutive lines compares equal.
2277 @item -k @var{pos1}[,@var{pos2}]
2280 The recommended, @sc{posix}, option for specifying a sort field. The field
2281 consists of the part of the line between @var{pos1} and @var{pos2} (or the
2282 end of the line, if @var{pos2} is omitted), @emph{inclusive}.
2283 Fields and character positions are numbered starting with 1.
2284 So to sort on the second field, you'd use @samp{-k 2,2}
2285 See below for more examples.
2289 @cindex sort zero-terminated lines
2290 Treat the input as a set of lines, each terminated by a zero byte (@sc{ascii}
2291 @sc{nul} (Null) character) instead of an @sc{ascii} @sc{lf} (Line Feed).
2292 This option can be useful in conjunction with @samp{perl -0} or
2293 @samp{find -print0} and @samp{xargs -0} which do the same in order to
2294 reliably handle arbitrary pathnames (even those which contain Line Feed
2297 @item +@var{pos1}[-@var{pos2}]
2298 The obsolete, traditional option for specifying a sort field. The field
2299 consists of the line between @var{pos1} and up to but @emph{not including}
2300 @var{pos2} (or the end of the line if @var{pos2} is omitted). Fields
2301 and character positions are numbered starting with 0. See below.
2305 In addition, when GNU @code{sort} is invoked with exactly one argument,
2306 options @samp{--help} and @samp{--version} are recognized. @xref{Common
2309 Historical (BSD and System V) implementations of @code{sort} have
2310 differed in their interpretation of some options, particularly
2311 @samp{-b}, @samp{-f}, and @samp{-n}. GNU sort follows the @sc{posix}
2312 behavior, which is usually (but not always!) like the System V behavior.
2313 According to @sc{posix}, @samp{-n} no longer implies @samp{-b}. For
2314 consistency, @samp{-M} has been changed in the same way. This may
2315 affect the meaning of character positions in field specifications in
2316 obscure cases. The only fix is to add an explicit @samp{-b}.
2318 A position in a sort field specified with the @samp{-k} or @samp{+}
2319 option has the form @samp{@var{f}.@var{c}}, where @var{f} is the number
2320 of the field to use and @var{c} is the number of the first character
2321 from the beginning of the field (for @samp{+@var{pos}}) or from the end
2322 of the previous field (for @samp{-@var{pos}}). If the @samp{.@var{c}}
2323 is omitted, it is taken to be the first character in the field. If the
2324 @samp{-b} option was specified, the @samp{.@var{c}} part of a field
2325 specification is counted from the first nonblank character of the field
2326 (for @samp{+@var{pos}}) or from the first nonblank character following
2327 the previous field (for @samp{-@var{pos}}).
2329 A sort key option may also have any of the option letters @samp{Mbdfinr}
2330 appended to it, in which case the global ordering options are not used
2331 for that particular field. The @samp{-b} option may be independently
2332 attached to either or both of the @samp{+@var{pos}} and
2333 @samp{-@var{pos}} parts of a field specification, and if it is inherited
2334 from the global options it will be attached to both.
2335 Keys may span multiple fields.
2337 Here are some examples to illustrate various combinations of options.
2338 In them, the @sc{posix} @samp{-k} option is used to specify sort keys rather
2339 than the obsolete @samp{+@var{pos1}-@var{pos2}} syntax.
2344 Sort in descending (reverse) numeric order.
2350 Sort alphabetically, omitting the first and second fields.
2351 This uses a single key composed of the characters beginning
2352 at the start of field three and extending to the end of each line.
2359 Sort numerically on the second field and resolve ties by sorting
2360 alphabetically on the third and fourth characters of field five.
2361 Use @samp{:} as the field delimiter.
2364 sort -t : -k 2,2n -k 5.3,5.4
2367 Note that if you had written @samp{-k 2} instead of @samp{-k 2,2}
2368 @samp{sort} would have used all characters beginning in the second field
2369 and extending to the end of the line as the primary @emph{numeric}
2370 key. For the large majority of applications, treating keys spanning
2371 more than one field as numeric will not do what you expect.
2373 Also note that the @samp{n} modifier was applied to the field-end
2374 specifier for the first key. It would have been equivalent to
2375 specify @samp{-k 2n,2} or @samp{-k 2n,2n}. All modifiers except
2376 @samp{b} apply to the associated @emph{field}, regardless of whether
2377 the modifier character is attached to the field-start and/or the
2378 field-end part of the key specifier.
2381 Sort the password file on the fifth field and ignore any
2382 leading white space. Sort lines with equal values in field five
2383 on the numeric user ID in field three.
2386 sort -t : -k 5b,5 -k 3,3n /etc/passwd
2389 An alternative is to use the global numeric modifier @samp{-n}.
2392 sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
2396 Generate a tags file in case insensitive sorted order.
2398 find src -type f -print0 | sort -t / -z -f | xargs -0 etags --append
2401 The use of @samp{-print0}, @samp{-z}, and @samp{-0} in this case mean
2402 that pathnames that contain Line Feed characters will not get broken up
2403 by the sort operation.
2405 Finally, to ignore both leading and trailing white space, you
2406 could have applied the @samp{b} modifier to the field-end specifier
2410 sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
2413 or by using the global @samp{-b} modifier instead of @samp{-n}
2414 and an explicit @samp{n} with the second key specifier.
2417 sort -t : -b -k 5,5 -k 3,3n /etc/passwd
2420 @c This example is a bit contrived and needs more explanation.
2422 @c Sort records separated by an arbitrary string by using a pipe to convert
2423 @c each record delimiter string to @samp{\0}, then using sort's -z option,
2424 @c and converting each @samp{\0} back to the original record delimiter.
2427 @c printf 'c\n\nb\n\na\n'|perl -0pe 's/\n\n/\n\0/g'|sort -z|perl -0pe 's/\0/\n/g'
2433 @node uniq invocation
2434 @section @code{uniq}: Uniquify files
2437 @cindex uniquify files
2439 @code{uniq} writes the unique lines in the given @file{input}, or
2440 standard input if nothing is given or for an @var{input} name of
2444 uniq [@var{option}]@dots{} [@var{input} [@var{output}]]
2447 By default, @code{uniq} prints the unique lines in a sorted file, i.e.,
2448 discards all but one of identical successive lines. Optionally, it can
2449 instead show only lines that appear exactly once, or lines that appear
2452 The input must be sorted. If your input is not sorted, perhaps you want
2453 to use @code{sort -u}.
2455 If no @var{output} file is specified, @code{uniq} writes to standard
2458 The program accepts the following options. Also see @ref{Common options}.
2464 @itemx --skip-fields=@var{n}
2467 @opindex --skip-fields
2468 Skip @var{n} fields on each line before checking for uniqueness. Fields
2469 are sequences of non-space non-tab characters that are separated from
2470 each other by at least one spaces or tabs.
2474 @itemx --skip-chars=@var{n}
2477 @opindex --skip-chars
2478 Skip @var{n} characters before checking for uniqueness. If you use both
2479 the field and character skipping options, fields are skipped over first.
2485 Print the number of times each line occurred along with the line.
2488 @itemx --ignore-case
2490 @opindex --ignore-case
2491 Ignore differences in case when comparing lines.
2497 @cindex duplicate lines, outputting
2498 Print only duplicate lines.
2501 @itemx --all-repeated
2503 @opindex --all-repeated
2504 @cindex all duplicate lines, outputting
2505 Print all duplicate lines and only duplicate lines.
2506 This option is useful mainly in conjunction with other options e.g.,
2507 to ignore case or to compare only selected fields.
2508 This is a GNU extension.
2509 @c FIXME: give an example showing *how* it's useful
2515 @cindex unique lines, outputting
2516 Print only unique lines.
2519 @itemx --check-chars=@var{n}
2521 @opindex --check-chars
2522 Compare @var{n} characters on each line (after skipping any specified
2523 fields and characters). By default the entire rest of the lines are
2529 @node comm invocation
2530 @section @code{comm}: Compare two sorted files line by line
2533 @cindex line-by-line comparison
2534 @cindex comparing sorted files
2536 @code{comm} writes to standard output lines that are common, and lines
2537 that are unique, to two input files; a file name of @samp{-} means
2538 standard input. Synopsis:
2541 comm [@var{option}]@dots{} @var{file1} @var{file2}
2545 Before @code{comm} can be used, the input files must be sorted using the
2546 collating sequence specified by the @env{LC_COLLATE} locale, with
2547 trailing newlines significant. If an input file ends in a non-newline
2548 character, a newline is silently appended. The @code{sort} command with
2549 no options always outputs a file that is suitable input to @code{comm}.
2551 @cindex differing lines
2552 @cindex common lines
2553 With no options, @code{comm} produces three column output. Column one
2554 contains lines unique to @var{file1}, column two contains lines unique
2555 to @var{file2}, and column three contains lines common to both files.
2556 Columns are separated by a single TAB character.
2557 @c FIXME: when there's an option to supply an alternative separator
2558 @c string, append `by default' to the above sentence.
2563 The options @samp{-1}, @samp{-2}, and @samp{-3} suppress printing of
2564 the corresponding columns. Also see @ref{Common options}.
2566 Unlike some other comparison utilities, @code{comm} has an exit
2567 status that does not depend on the result of the comparison.
2568 Upon normal completion @code{comm} produces an exit code of zero.
2569 If there is an error it exits with nonzero status.
2572 @node tsort invocation
2573 @section @code{tsort}: Topological sort
2576 @cindex topological sort
2578 @code{tsort} performs a topological sort on the given @var{file}, or
2579 standard input if no input file is given or for a @var{file} of
2583 tsort [@var{option}] [@var{file}]
2586 @code{tsort} reads its input as pairs of strings, separated by blanks,
2587 indicating a partial ordering. The output is a total ordering that
2588 corresponds to the given partial ordering.
2602 will produce the output
2613 @code{tsort} will detect cycles in the input and writes the first cycle
2614 encountered to standard error.
2616 Note that for a given partial ordering, generally there is no unique
2619 The only options are @samp{--help} and @samp{--version}. @xref{Common
2623 @node ptx invocation
2624 @section @code{ptx}: Produce permuted indexes
2628 @code{ptx} reads a text file and essentially produces a permuted index, with
2629 each keyword in its context. The calling sketch is either one of:
2632 ptx [@var{option} @dots{}] [@var{file} @dots{}]
2633 ptx -G [@var{option} @dots{}] [@var{input} [@var{output}]]
2636 The @samp{-G} (or its equivalent: @samp{--traditional}) option disables
2637 all GNU extensions and revert to traditional mode, thus introducing some
2638 limitations, and changes several of the program's default option values.
2639 When @samp{-G} is not specified, GNU extensions are always enabled. GNU
2640 extensions to @code{ptx} are documented wherever appropriate in this
2641 document. For the full list, see @xref{Compatibility in ptx}.
2643 Individual options are explained in incoming sections.
2645 When GNU extensions are enabled, there may be zero, one or several
2646 @var{file} after the options. If there is no @var{file}, the program
2647 reads the standard input. If there is one or several @var{file}, they
2648 give the name of input files which are all read in turn, as if all the
2649 input files were concatenated. However, there is a full contextual
2650 break between each file and, when automatic referencing is requested,
2651 file names and line numbers refer to individual text input files. In
2652 all cases, the program produces the permuted index onto the standard
2655 When GNU extensions are @emph{not} enabled, that is, when the program
2656 operates in traditional mode, there may be zero, one or two parameters
2657 besides the options. If there is no parameters, the program reads the
2658 standard input and produces the permuted index onto the standard output.
2659 If there is only one parameter, it names the text @var{input} to be read
2660 instead of the standard input. If two parameters are given, they give
2661 respectively the name of the @var{input} file to read and the name of
2662 the @var{output} file to produce. @emph{Be very careful} to note that,
2663 in this case, the contents of file given by the second parameter is
2664 destroyed. This behaviour is dictated only by System V @code{ptx}
2665 compatibility, because GNU Standards discourage output parameters not
2666 introduced by an option.
2668 Note that for @emph{any} file named as the value of an option or as an
2669 input text file, a single dash @kbd{-} may be used, in which case
2670 standard input is assumed. However, it would not make sense to use this
2671 convention more than once per program invocation.
2674 * General options in ptx:: Options which affect general program behaviour.
2675 * Charset selection in ptx:: Underlying character set considerations.
2676 * Input processing in ptx:: Input fields, contexts, and keyword selection.
2677 * Output formatting in ptx:: Types of output format, and sizing the fields.
2678 * Compatibility in ptx::
2682 @node General options in ptx
2683 @subsection General options
2689 Prints a short note about the Copyright and copying conditions, then
2690 exit without further processing.
2693 @itemx --traditional
2694 As already explained, this option disables all GNU extensions to
2695 @code{ptx} and switch to traditional mode.
2698 Prints a short help on standard output, then exit without further
2702 Prints the program verison on standard output, then exit without further
2708 @node Charset selection in ptx
2709 @subsection Charset selection
2711 As it is setup now, the program assumes that the input file is coded
2712 using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
2713 @emph{unless} if it is compiled for MS-DOS, in which case it uses the
2714 character set of the IBM-PC. (GNU @code{ptx} is not known to work on
2715 smaller MS-DOS machines anymore.) Compared to 7-bit @sc{ascii}, the set of
2716 characters which are letters is then different, this fact alters the
2717 behaviour of regular expression matching. Thus, the default regular
2718 expression for a keyword allows foreign or diacriticized letters.
2719 Keyword sorting, however, is still crude; it obeys the underlying
2720 character set ordering quite blindly.
2725 @itemx --ignore-case
2726 Fold lower case letters to upper case for sorting.
2731 @node Input processing in ptx
2732 @subsection Word selection and input processing
2737 @item --break-file=@var{file}
2739 This option provides an alternative (to @samp{-W}) method of describing
2740 which characters make up words. It introduces the name of a
2741 file which contains a list of characters which can@emph{not} be part of
2742 one word, this file is called the @dfn{Break file}. Any character which
2743 is not part of the Break file is a word constituent. If both options
2744 @samp{-b} and @samp{-W} are specified, then @samp{-W} has precedence and
2745 @samp{-b} is ignored.
2747 When GNU extensions are enabled, the only way to avoid newline as a
2748 break character is to write all the break characters in the file with no
2749 newline at all, not even at the end of the file. When GNU extensions
2750 are disabled, spaces, tabs and newlines are always considered as break
2751 characters even if not included in the Break file.
2754 @itemx --ignore-file=@var{file}
2756 The file associated with this option contains a list of words which will
2757 never be taken as keywords in concordance output. It is called the
2758 @dfn{Ignore file}. The file contains exactly one word in each line; the
2759 end of line separation of words is not subject to the value of the
2762 There is a default Ignore file used by @code{ptx} when this option is
2763 not specified, usually found in @file{/usr/local/lib/eign} if this has
2764 not been changed at installation time. If you want to deactivate the
2765 default Ignore file, specify @code{/dev/null} instead.
2768 @itemx --only-file=@var{file}
2770 The file associated with this option contains a list of words which will
2771 be retained in concordance output, any word not mentioned in this file
2772 is ignored. The file is called the @dfn{Only file}. The file contains
2773 exactly one word in each line; the end of line separation of words is
2774 not subject to the value of the @samp{-S} option.
2776 There is no default for the Only file. In the case there are both an
2777 Only file and an Ignore file, a word will be subject to be a keyword
2778 only if it is given in the Only file and not given in the Ignore file.
2783 On each input line, the leading sequence of non white characters will be
2784 taken to be a reference that has the purpose of identifying this input
2785 line on the produced permuted index. For more information about reference
2786 production, see @xref{Output formatting in ptx}.
2787 Using this option changes the default value for option @samp{-S}.
2789 Using this option, the program does not try very hard to remove
2790 references from contexts in output, but it succeeds in doing so
2791 @emph{when} the context ends exactly at the newline. If option
2792 @samp{-r} is used with @samp{-S} default value, or when GNU extensions
2793 are disabled, this condition is always met and references are completely
2794 excluded from the output contexts.
2796 @item -S @var{regexp}
2797 @itemx --sentence-regexp=@var{regexp}
2799 This option selects which regular expression will describe the end of a
2800 line or the end of a sentence. In fact, there is other distinction
2801 between end of lines or end of sentences than the effect of this regular
2802 expression, and input line boundaries have no special significance
2803 outside this option. By default, when GNU extensions are enabled and if
2804 @samp{-r} option is not used, end of sentences are used. In this
2805 case, the precise @var{regex} is imported from GNU emacs:
2808 [.?!][]\"')@}]*\\($\\|\t\\| \\)[ \t\n]*
2811 Whenever GNU extensions are disabled or if @samp{-r} option is used, end
2812 of lines are used; in this case, the default @var{regexp} is just:
2818 Using an empty @var{regexp} is equivalent to completely disabling end of
2819 line or end of sentence recognition. In this case, the whole file is
2820 considered to be a single big line or sentence. The user might want to
2821 disallow all truncation flag generation as well, through option @samp{-F
2822 ""}. @xref{Regexps, , Syntax of Regular Expressions, emacs, The GNU Emacs
2825 When the keywords happen to be near the beginning of the input line or
2826 sentence, this often creates an unused area at the beginning of the
2827 output context line; when the keywords happen to be near the end of the
2828 input line or sentence, this often creates an unused area at the end of
2829 the output context line. The program tries to fill those unused areas
2830 by wrapping around context in them; the tail of the input line or
2831 sentence is used to fill the unused area on the left of the output line;
2832 the head of the input line or sentence is used to fill the unused area
2833 on the right of the output line.
2835 As a matter of convenience to the user, many usual backslashed escape
2836 sequences, as found in the C language, are recognized and converted to
2837 the corresponding characters by @code{ptx} itself.
2839 @item -W @var{regexp}
2840 @itemx --word-regexp=@var{regexp}
2842 This option selects which regular expression will describe each keyword.
2843 By default, if GNU extensions are enabled, a word is a sequence of
2844 letters; the @var{regexp} used is @samp{\w+}. When GNU extensions are
2845 disabled, a word is by default anything which ends with a space, a tab
2846 or a newline; the @var{regexp} used is @samp{[^ \t\n]+}.
2848 An empty @var{regexp} is equivalent to not using this option, letting the
2849 default dive in. @xref{Regexps, , Syntax of Regular Expressions, emacs,
2850 The GNU Emacs Manual}.
2852 As a matter of convenience to the user, many usual backslashed escape
2853 sequences, as found in the C language, are recognized and converted to
2854 the corresponding characters by @code{ptx} itself.
2859 @node Output formatting in ptx
2860 @subsection Output formatting
2862 Output format is mainly controlled by @samp{-O} and @samp{-T} options,
2863 described in the table below. When neither @samp{-O} nor @samp{-T} is
2864 selected, and if GNU extensions are enabled, the program choose an
2865 output format suited for a dumb terminal. Each keyword occurrence is
2866 output to the center of one line, surrounded by its left and right
2867 contexts. Each field is properly justified, so the concordance output
2868 could readily be observed. As a special feature, if automatic
2869 references are selected by option @samp{-A} and are output before the
2870 left context, that is, if option @samp{-R} is @emph{not} selected, then
2871 a colon is added after the reference; this nicely interfaces with GNU
2872 Emacs @code{next-error} processing. In this default output format, each
2873 white space character, like newline and tab, is merely changed to
2874 exactly one space, with no special attempt to compress consecutive
2875 spaces. This might change in the future. Except for those white space
2876 characters, every other character of the underlying set of 256
2877 characters is transmitted verbatim.
2879 Output format is further controlled by the following options.
2883 @item -g @var{number}
2884 @itemx --gap-size=@var{number}
2886 Select the size of the minimum white gap between the fields on the output
2889 @item -w @var{number}
2890 @itemx --width=@var{number}
2892 Select the output maximum width of each final line. If references are
2893 used, they are included or excluded from the output maximum width
2894 depending on the value of option @samp{-R}. If this option is not
2895 selected, that is, when references are output before the left context,
2896 the output maximum width takes into account the maximum length of all
2897 references. If this options is selected, that is, when references are
2898 output after the right context, the output maximum width does not take
2899 into account the space taken by references, nor the gap that precedes
2903 @itemx --auto-reference
2905 Select automatic references. Each input line will have an automatic
2906 reference made up of the file name and the line ordinal, with a single
2907 colon between them. However, the file name will be empty when standard
2908 input is being read. If both @samp{-A} and @samp{-r} are selected, then
2909 the input reference is still read and skipped, but the automatic
2910 reference is used at output time, overriding the input reference.
2913 @itemx --right-side-refs
2915 In default output format, when option @samp{-R} is not used, any
2916 reference produced by the effect of options @samp{-r} or @samp{-A} are
2917 given to the far right of output lines, after the right context. In
2918 default output format, when option @samp{-R} is specified, references
2919 are rather given to the beginning of each output line, before the left
2920 context. For any other output format, option @samp{-R} is almost
2921 ignored, except for the fact that the width of references is @emph{not}
2922 taken into account in total output width given by @samp{-w} whenever
2923 @samp{-R} is selected.
2925 This option is automatically selected whenever GNU extensions are
2928 @item -F @var{string}
2929 @itemx --flac-truncation=@var{string}
2931 This option will request that any truncation in the output be reported
2932 using the string @var{string}. Most output fields theoretically extend
2933 towards the beginning or the end of the current line, or current
2934 sentence, as selected with option @samp{-S}. But there is a maximum
2935 allowed output line width, changeable through option @samp{-w}, which is
2936 further divided into space for various output fields. When a field has
2937 to be truncated because cannot extend until the beginning or the end of
2938 the current line to fit in the, then a truncation occurs. By default,
2939 the string used is a single slash, as in @samp{-F /}.
2941 @var{string} may have more than one character, as in @samp{-F ...}.
2942 Also, in the particular case @var{string} is empty (@samp{-F ""}),
2943 truncation flagging is disabled, and no truncation marks are appended in
2946 As a matter of convenience to the user, many usual backslashed escape
2947 sequences, as found in the C language, are recognized and converted to
2948 the corresponding characters by @code{ptx} itself.
2950 @item -M @var{string}
2951 @itemx --macro-name=@var{string}
2953 Select another @var{string} to be used instead of @samp{xx}, while
2954 generating output suitable for @code{nroff}, @code{troff} or @TeX{}.
2957 @itemx --format=roff
2959 Choose an output format suitable for @code{nroff} or @code{troff}
2960 processing. Each output line will look like:
2963 .xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}"
2966 so it will be possible to write an @samp{.xx} roff macro to take care of
2967 the output typesetting. This is the default output format when GNU
2968 extensions are disabled. Option @samp{-M} might be used to change
2969 @samp{xx} to another macro name.
2971 In this output format, each non-graphical character, like newline and
2972 tab, is merely changed to exactly one space, with no special attempt to
2973 compress consecutive spaces. Each quote character: @kbd{"} is doubled
2974 so it will be correctly processed by @code{nroff} or @code{troff}.
2979 Choose an output format suitable for @TeX{} processing. Each output
2980 line will look like:
2983 \xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@}
2987 so it will be possible to write a @code{\xx} definition to take care of
2988 the output typesetting. Note that when references are not being
2989 produced, that is, neither option @samp{-A} nor option @samp{-r} is
2990 selected, the last parameter of each @code{\xx} call is inhibited.
2991 Option @samp{-M} might be used to change @samp{xx} to another macro
2994 In this output format, some special characters, like @kbd{$}, @kbd{%},
2995 @kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a
2996 backslash. Curly brackets @kbd{@{}, @kbd{@}} are also protected with a
2997 backslash, but also enclosed in a pair of dollar signs to force
2998 mathematical mode. The backslash itself produces the sequence
2999 @code{\backslash@{@}}. Circumflex and tilde diacritics produce the
3000 sequence @code{^\@{ @}} and @code{~\@{ @}} respectively. Other
3001 diacriticized characters of the underlying character set produce an
3002 appropriate @TeX{} sequence as far as possible. The other non-graphical
3003 characters, like newline and tab, and all others characters which are
3004 not part of @sc{ascii}, are merely changed to exactly one space, with no
3005 special attempt to compress consecutive spaces. Let me know how to
3006 improve this special character processing for @TeX{}.
3011 @node Compatibility in ptx
3012 @subsection The GNU extensions to @code{ptx}
3014 This version of @code{ptx} contains a few features which do not exist in
3015 System V @code{ptx}. These extra features are suppressed by using the
3016 @samp{-G} command line option, unless overridden by other command line
3017 options. Some GNU extensions cannot be recovered by overriding, so the
3018 simple rule is to avoid @samp{-G} if you care about GNU extensions.
3019 Here are the differences between this program and System V @code{ptx}.
3024 This program can read many input files at once, it always writes the
3025 resulting concordance on standard output. On the other end, System V
3026 @code{ptx} reads only one file and produce the result on standard output
3027 or, if a second @var{file} parameter is given on the command, to that
3030 Having output parameters not introduced by options is a quite dangerous
3031 practice which GNU avoids as far as possible. So, for using @code{ptx}
3032 portably between GNU and System V, you should pay attention to always
3033 use it with a single input file, and always expect the result on
3034 standard output. You might also want to automatically configure in a
3035 @samp{-G} option to @code{ptx} calls in products using @code{ptx}, if
3036 the configurator finds that the installed @code{ptx} accepts @samp{-G}.
3039 The only options available in System V @code{ptx} are options @samp{-b},
3040 @samp{-f}, @samp{-g}, @samp{-i}, @samp{-o}, @samp{-r}, @samp{-t} and
3041 @samp{-w}. All other options are GNU extensions and are not repeated in
3042 this enumeration. Moreover, some options have a slightly different
3043 meaning when GNU extensions are enabled, as explained below.
3046 By default, concordance output is not formatted for @code{troff} or
3047 @code{nroff}. It is rather formatted for a dumb terminal. @code{troff}
3048 or @code{nroff} output may still be selected through option @samp{-O}.
3051 Unless @samp{-R} option is used, the maximum reference width is
3052 subtracted from the total output line width. With GNU extensions
3053 disabled, width of references is not taken into account in the output
3054 line width computations.
3057 All 256 characters, even @kbd{NUL}s, are always read and processed from
3058 input file with no adverse effect, even if GNU extensions are disabled.
3059 However, System V @code{ptx} does not accept 8-bit characters, a few
3060 control characters are rejected, and the tilde @kbd{~} is condemned.
3063 Input line length is only limited by available memory, even if GNU
3064 extensions are disabled. However, System V @code{ptx} processes only
3065 the first 200 characters in each line.
3068 The break (non-word) characters default to be every character except all
3069 letters of the underlying character set, diacriticized or not. When GNU
3070 extensions are disabled, the break characters default to space, tab and
3074 The program makes better use of output line width. If GNU extensions
3075 are disabled, the program rather tries to imitate System V @code{ptx},
3076 but still, there are some slight disposition glitches this program does
3077 not completely reproduce.
3080 The user can specify both an Ignore file and an Only file. This is not
3081 allowed with System V @code{ptx}.
3086 @node Operating on fields within a line
3087 @chapter Operating on fields within a line
3090 * cut invocation:: Print selected parts of lines.
3091 * paste invocation:: Merge lines of files.
3092 * join invocation:: Join lines on a common field.
3096 @node cut invocation
3097 @section @code{cut}: Print selected parts of lines
3100 @code{cut} writes to standard output selected parts of each line of each
3101 input file, or standard input if no files are given or for a file name of
3105 cut [@var{option}]@dots{} [@var{file}]@dots{}
3108 In the table which follows, the @var{byte-list}, @var{character-list},
3109 and @var{field-list} are one or more numbers or ranges (two numbers
3110 separated by a dash) separated by commas. Bytes, characters, and
3111 fields are numbered from starting at 1. Incomplete ranges may be
3112 given: @samp{-@var{m}} means @samp{1-@var{m}}; @samp{@var{n}-} means
3113 @samp{@var{n}} through end of line or last field.
3115 The program accepts the following options. Also see @ref{Common
3120 @item -b @var{byte-list}
3121 @itemx --bytes=@var{byte-list}
3124 Print only the bytes in positions listed in @var{byte-list}. Tabs and
3125 backspaces are treated like any other character; they take up 1 byte.
3127 @item -c @var{character-list}
3128 @itemx --characters=@var{character-list}
3130 @opindex --characters
3131 Print only characters in positions listed in @var{character-list}.
3132 The same as @samp{-b} for now, but internationalization will change
3133 that. Tabs and backspaces are treated like any other character; they
3134 take up 1 character.
3136 @item -f @var{field-list}
3137 @itemx --fields=@var{field-list}
3140 Print only the fields listed in @var{field-list}. Fields are
3141 separated by a TAB character by default.
3143 @item -d @var{input_delim_byte}
3144 @itemx --delimiter=@var{input_delim_byte}
3146 @opindex --delimiter
3147 For @samp{-f}, fields are separated in the input by the first character
3148 in @var{input_delim_byte} (default is TAB).
3152 Do not split multi-byte characters (no-op for now).
3155 @itemx --only-delimited
3157 @opindex --only-delimited
3158 For @samp{-f}, do not print lines that do not contain the field separator
3161 @itemx --output-delimiter=@var{output_delim_string}
3162 @opindex --output-delimiter
3163 For @samp{-f}, output fields are separated by @var{output_delim_string}
3164 The default is to use the input delimiter.
3170 @node paste invocation
3171 @section @code{paste}: Merge lines of files
3174 @cindex merging files
3176 @code{paste} writes to standard output lines consisting of sequentially
3177 corresponding lines of each given file, separated by a TAB character.
3178 Standard input is used for a file name of @samp{-} or if no input files
3184 paste [@var{option}]@dots{} [@var{file}]@dots{}
3187 The program accepts the following options. Also see @ref{Common options}.
3195 Paste the lines of one file at a time rather than one line from each
3198 @item -d @var{delim-list}
3199 @itemx --delimiters @var{delim-list}
3201 @opindex --delimiters
3202 Consecutively use the characters in @var{delim-list} instead of
3203 TAB to separate merged lines. When @var{delim-list} is
3204 exhausted, start again at its beginning.
3209 @node join invocation
3210 @section @code{join}: Join lines on a common field
3213 @cindex common field, joining on
3215 @code{join} writes to standard output a line for each pair of input
3216 lines that have identical join fields. Synopsis:
3219 join [@var{option}]@dots{} @var{file1} @var{file2}
3223 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
3224 meaning standard input. @var{file1} and @var{file2} should be already
3225 sorted in increasing textual order on the join fields, using the
3226 collating sequence specified by the @env{LC_COLLATE} locale. Unless
3227 the @samp{-t} option is given, the input should be sorted ignoring blanks at
3228 the start of the join field, as in @code{sort -b}. If the
3229 @samp{--ignore-case} option is given, lines should be sorted without
3230 regard to the case of characters in the join field, as in @code{sort -f}.
3232 The defaults are: the join field is the first field in each line;
3233 fields in the input are separated by one or more blanks, with leading
3234 blanks on the line ignored; fields in the output are separated by a
3235 space; each output line consists of the join field, the remaining
3236 fields from @var{file1}, then the remaining fields from @var{file2}.
3238 The program accepts the following options. Also see @ref{Common options}.
3242 @item -a @var{file-number}
3244 Print a line for each unpairable line in file @var{file-number} (either
3245 @samp{1} or @samp{2}), in addition to the normal output.
3247 @item -e @var{string}
3249 Replace those output fields that are missing in the input with
3253 @itemx --ignore-case
3255 @opindex --ignore-case
3256 Ignore differences in case when comparing keys.
3257 With this option, the lines of the input files must be ordered in the same way.
3258 Use @samp{sort -f} to produce this ordering.
3260 @item -1 @var{field}
3261 @itemx -j1 @var{field}
3264 Join on field @var{field} (a positive integer) of file 1.
3266 @item -2 @var{field}
3267 @itemx -j2 @var{field}
3270 Join on field @var{field} (a positive integer) of file 2.
3272 @item -j @var{field}
3273 Equivalent to @samp{-1 @var{field} -2 @var{field}}.
3275 @item -o @var{field-list}@dots{}
3276 Construct each output line according to the format in @var{field-list}.
3277 Each element in @var{field-list} is either the single character @samp{0} or
3278 has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
3279 @samp{2} and @var{n} is a positive field number.
3281 A field specification of @samp{0} denotes the join field.
3282 In most cases, the functionality of the @samp{0} field spec
3283 may be reproduced using the explicit @var{m.n} that corresponds
3284 to the join field. However, when printing unpairable lines
3285 (using either of the @samp{-a} or @samp{-v} options), there is no way
3286 to specify the join field using @var{m.n} in @var{field-list}
3287 if there are unpairable lines in both files.
3288 To give @code{join} that functionality, @sc{posix} invented the @samp{0}
3289 field specification notation.
3291 The elements in @var{field-list}
3292 are separated by commas or blanks. Multiple @var{field-list}
3293 arguments can be given after a single @samp{-o} option; the values
3294 of all lists given with @samp{-o} are concatenated together.
3295 All output lines -- including those printed because of any -a or -v
3296 option -- are subject to the specified @var{field-list}.
3299 Use character @var{char} as the input and output field separator.
3301 @item -v @var{file-number}
3302 Print a line for each unpairable line in file @var{file-number}
3303 (either @samp{1} or @samp{2}), instead of the normal output.
3307 In addition, when GNU @code{join} is invoked with exactly one argument,
3308 options @samp{--help} and @samp{--version} are recognized. @xref{Common
3312 @node Operating on characters
3313 @chapter Operating on characters
3315 @cindex operating on characters
3317 This commands operate on individual characters.
3320 * tr invocation:: Translate, squeeze, and/or delete characters.
3321 * expand invocation:: Convert tabs to spaces.
3322 * unexpand invocation:: Convert spaces to tabs.
3327 @section @code{tr}: Translate, squeeze, and/or delete characters
3334 tr [@var{option}]@dots{} @var{set1} [@var{set2}]
3337 @code{tr} copies standard input to standard output, performing
3338 one of the following operations:
3342 translate, and optionally squeeze repeated characters in the result,
3344 squeeze repeated characters,
3348 delete characters, then squeeze repeated characters from the result.
3351 The @var{set1} and (if given) @var{set2} arguments define ordered
3352 sets of characters, referred to below as @var{set1} and @var{set2}. These
3353 sets are the characters of the input that @code{tr} operates on.
3354 The @samp{--complement} (@samp{-c}) option replaces @var{set1} with its
3355 complement (all of the characters that are not in @var{set1}).
3358 * Character sets:: Specifying sets of characters.
3359 * Translating:: Changing one characters to another.
3360 * Squeezing:: Squeezing repeats and deleting.
3361 * Warnings in tr:: Warning messages.
3365 @node Character sets
3366 @subsection Specifying sets of characters
3368 @cindex specifying sets of characters
3370 The format of the @var{set1} and @var{set2} arguments resembles
3371 the format of regular expressions; however, they are not regular
3372 expressions, only lists of characters. Most characters simply
3373 represent themselves in these strings, but the strings can contain
3374 the shorthands listed below, for convenience. Some of them can be
3375 used only in @var{set1} or @var{set2}, as noted below.
3379 @item Backslash escapes
3380 @cindex backslash escapes
3382 A backslash followed by a character not listed below causes an error
3401 The character with the value given by @var{ooo}, which is 1 to 3
3410 The notation @samp{@var{m}-@var{n}} expands to all of the characters
3411 from @var{m} through @var{n}, in ascending order. @var{m} should
3412 collate before @var{n}; if it doesn't, an error results. As an example,
3413 @samp{0-9} is the same as @samp{0123456789}. Although GNU @code{tr}
3414 does not support the System V syntax that uses square brackets to
3415 enclose ranges, translations specified in that format will still work as
3416 long as the brackets in @var{string1} correspond to identical brackets
3419 @item Repeated characters
3420 @cindex repeated characters
3422 The notation @samp{[@var{c}*@var{n}]} in @var{set2} expands to @var{n}
3423 copies of character @var{c}. Thus, @samp{[y*6]} is the same as
3424 @samp{yyyyyy}. The notation @samp{[@var{c}*]} in @var{string2} expands
3425 to as many copies of @var{c} as are needed to make @var{set2} as long as
3426 @var{set1}. If @var{n} begins with @samp{0}, it is interpreted in
3427 octal, otherwise in decimal.
3429 @item Character classes
3430 @cindex characters classes
3432 The notation @samp{[:@var{class}:]} expands to all of the characters in
3433 the (predefined) class @var{class}. The characters expand in no
3434 particular order, except for the @code{upper} and @code{lower} classes,
3435 which expand in ascending order. When the @samp{--delete} (@samp{-d})
3436 and @samp{--squeeze-repeats} (@samp{-s}) options are both given, any
3437 character class can be used in @var{set2}. Otherwise, only the
3438 character classes @code{lower} and @code{upper} are accepted in
3439 @var{set2}, and then only if the corresponding character class
3440 (@code{upper} and @code{lower}, respectively) is specified in the same
3441 relative position in @var{set1}. Doing this specifies case conversion.
3442 The class names are given below; an error results when an invalid class
3454 Horizontal whitespace.
3463 Printable characters, not including space.
3469 Printable characters, including space.
3472 Punctuation characters.
3475 Horizontal or vertical whitespace.
3484 @item Equivalence classes
3485 @cindex equivalence classes
3487 The syntax @samp{[=@var{c}=]} expands to all of the characters that are
3488 equivalent to @var{c}, in no particular order. Equivalence classes are
3489 a relatively recent invention intended to support non-English alphabets.
3490 But there seems to be no standard way to define them or determine their
3491 contents. Therefore, they are not fully implemented in GNU @code{tr};
3492 each character's equivalence class consists only of that character,
3493 which is of no particular use.
3499 @subsection Translating
3501 @cindex translating characters
3503 @code{tr} performs translation when @var{set1} and @var{set2} are
3504 both given and the @samp{--delete} (@samp{-d}) option is not given.
3505 @code{tr} translates each character of its input that is in @var{set1}
3506 to the corresponding character in @var{set2}. Characters not in
3507 @var{set1} are passed through unchanged. When a character appears more
3508 than once in @var{set1} and the corresponding characters in @var{set2}
3509 are not all the same, only the final one is used. For example, these
3510 two commands are equivalent:
3517 A common use of @code{tr} is to convert lowercase characters to
3518 uppercase. This can be done in many ways. Here are three of them:
3521 tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
3523 tr '[:lower:]' '[:upper:]'
3526 When @code{tr} is performing translation, @var{set1} and @var{set2}
3527 typically have the same length. If @var{set1} is shorter than
3528 @var{set2}, the extra characters at the end of @var{set2} are ignored.
3530 On the other hand, making @var{set1} longer than @var{set2} is not
3531 portable; @sc{posix.2} says that the result is undefined. In this situation,
3532 BSD @code{tr} pads @var{set2} to the length of @var{set1} by repeating
3533 the last character of @var{set2} as many times as necessary. System V
3534 @code{tr} truncates @var{set1} to the length of @var{set2}.
3536 By default, GNU @code{tr} handles this case like BSD @code{tr}. When
3537 the @samp{--truncate-set1} (@samp{-t}) option is given, GNU @code{tr}
3538 handles this case like the System V @code{tr} instead. This option is
3539 ignored for operations other than translation.
3541 Acting like System V @code{tr} in this case breaks the relatively common
3545 tr -cs A-Za-z0-9 '\012'
3549 because it converts only zero bytes (the first element in the
3550 complement of @var{set1}), rather than all non-alphanumerics, to
3555 @subsection Squeezing repeats and deleting
3557 @cindex squeezing repeat characters
3558 @cindex deleting characters
3560 When given just the @samp{--delete} (@samp{-d}) option, @code{tr}
3561 removes any input characters that are in @var{set1}.
3563 When given just the @samp{--squeeze-repeats} (@samp{-s}) option,
3564 @code{tr} replaces each input sequence of a repeated character that
3565 is in @var{set1} with a single occurrence of that character.
3567 When given both @samp{--delete} and @samp{--squeeze-repeats}, @code{tr}
3568 first performs any deletions using @var{set1}, then squeezes repeats
3569 from any remaining characters using @var{set2}.
3571 The @samp{--squeeze-repeats} option may also be used when translating,
3572 in which case @code{tr} first performs translation, then squeezes
3573 repeats from any remaining characters using @var{set2}.
3575 Here are some examples to illustrate various combinations of options:
3580 Remove all zero bytes:
3587 Put all words on lines by themselves. This converts all
3588 non-alphanumeric characters to newlines, then squeezes each string
3589 of repeated newlines into a single newline:
3592 tr -cs 'a-zA-Z0-9' '[\n*]'
3596 Convert each sequence of repeated newlines to a single newline:
3603 Find doubled occurrences of words in a document.
3604 For example, people often write ``the the'' with the duplicated words
3605 separated by a newline. The bourne shell script below works first
3606 by converting each sequence of punctuation and blank characters to a
3607 single newline. That puts each ``word'' on a line by itself.
3608 Next it maps all uppercase characters to lower case, and finally it
3609 runs @code{uniq} with the @samp{-d} option to print out only the words
3610 that were adjacent duplicates.
3615 | tr -s '[:punct:][:blank:]' '\n' \
3616 | tr '[:upper:]' '[:lower:]' \
3623 @node Warnings in tr
3624 @subsection Warning messages
3626 @vindex POSIXLY_CORRECT
3627 Setting the environment variable @env{POSIXLY_CORRECT} turns off the
3628 following warning and error messages, for strict compliance with
3629 @sc{posix.2}. Otherwise, the following diagnostics are issued:
3634 When the @samp{--delete} option is given but @samp{--squeeze-repeats}
3635 is not, and @var{set2} is given, GNU @code{tr} by default prints
3636 a usage message and exits, because @var{set2} would not be used.
3637 The @sc{posix} specification says that @var{set2} must be ignored in
3638 this case. Silently ignoring arguments is a bad idea.
3641 When an ambiguous octal escape is given. For example, @samp{\400}
3642 is actually @samp{\40} followed by the digit @samp{0}, because the
3643 value 400 octal does not fit into a single byte.
3647 GNU @code{tr} does not provide complete BSD or System V compatibility.
3648 For example, it is impossible to disable interpretation of the @sc{posix}
3649 constructs @samp{[:alpha:]}, @samp{[=c=]}, and @samp{[c*10]}. Also, GNU
3650 @code{tr} does not delete zero bytes automatically, unlike traditional
3651 Unix versions, which provide no way to preserve zero bytes.
3654 @node expand invocation
3655 @section @code{expand}: Convert tabs to spaces
3658 @cindex tabs to spaces, converting
3659 @cindex converting tabs to spaces
3661 @code{expand} writes the contents of each given @var{file}, or standard
3662 input if none are given or for a @var{file} of @samp{-}, to standard
3663 output, with tab characters converted to the appropriate number of
3667 expand [@var{option}]@dots{} [@var{file}]@dots{}
3670 By default, @code{expand} converts all tabs to spaces. It preserves
3671 backspace characters in the output; they decrement the column count for
3672 tab calculations. The default action is equivalent to @samp{-8} (set
3673 tabs every 8 columns).
3675 The program accepts the following options. Also see @ref{Common options}.
3679 @item -@var{tab1}[,@var{tab2}]@dots{}
3680 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3681 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3685 @cindex tabstops, setting
3686 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3687 (default is 8). Otherwise, set the tabs at columns @var{tab1},
3688 @var{tab2}, @dots{} (numbered from 0), and replace any tabs beyond the
3689 last tabstop given with single spaces. If the tabstops are specified
3690 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3691 blanks as well as by commas.
3697 @cindex initial tabs, converting
3698 Only convert initial tabs (those that precede all non-space or non-tab
3699 characters) on each line to spaces.
3704 @node unexpand invocation
3705 @section @code{unexpand}: Convert spaces to tabs
3709 @code{unexpand} writes the contents of each given @var{file}, or
3710 standard input if none are given or for a @var{file} of @samp{-}, to
3711 standard output, with strings of two or more space or tab characters
3712 converted to as many tabs as possible followed by as many spaces as are
3716 unexpand [@var{option}]@dots{} [@var{file}]@dots{}
3719 By default, @code{unexpand} converts only initial spaces and tabs (those
3720 that precede all non space or tab characters) on each line. It
3721 preserves backspace characters in the output; they decrement the column
3722 count for tab calculations. By default, tabs are set at every 8th
3725 The program accepts the following options. Also see @ref{Common options}.
3729 @item -@var{tab1}[,@var{tab2}]@dots{}
3730 @itemx -t @var{tab1}[,@var{tab2}]@dots{}
3731 @itemx --tabs=@var{tab1}[,@var{tab2}]@dots{}
3735 If only one tab stop is given, set the tabs @var{tab1} spaces apart
3736 instead of the default 8. Otherwise, set the tabs at columns
3737 @var{tab1}, @var{tab2}, @dots{} (numbered from 0), and leave spaces and
3738 tabs beyond the tabstops given unchanged. If the tabstops are specified
3739 with the @samp{-t} or @samp{--tabs} option, they can be separated by
3740 blanks as well as by commas. This option implies the @samp{-a} option.
3746 Convert all strings of two or more spaces or tabs, not just initial
3753 @node Opening the software toolbox
3754 @chapter Opening the software toolbox
3756 This chapter originally appeared in @cite{Linux Journal}, volume 1,
3757 number 2, in the @cite{What's GNU?} column. It was written by Arnold
3761 * Toolbox introduction:: Toolbox introduction
3762 * I/O redirection:: I/O redirection
3763 * The who command:: The @code{who} command
3764 * The cut command:: The @code{cut} command
3765 * The sort command:: The @code{sort} command
3766 * The uniq command:: The @code{uniq} command
3767 * Putting the tools together:: Putting the tools together
3771 @node Toolbox introduction
3772 @unnumberedsec Toolbox introduction
3774 This month's column is only peripherally related to the GNU Project, in
3775 that it describes a number of the GNU tools on your Linux system and how they
3776 might be used. What it's really about is the ``Software Tools'' philosophy
3777 of program development and usage.
3779 The software tools philosophy was an important and integral concept
3780 in the initial design and development of Unix (of which Linux and GNU are
3781 essentially clones). Unfortunately, in the modern day press of
3782 Internetworking and flashy GUIs, it seems to have fallen by the
3783 wayside. This is a shame, since it provides a powerful mental model
3784 for solving many kinds of problems.
3786 Many people carry a Swiss Army knife around in their pants pockets (or
3787 purse). A Swiss Army knife is a handy tool to have: it has several knife
3788 blades, a screwdriver, tweezers, toothpick, nail file, corkscrew, and perhaps
3789 a number of other things on it. For the everyday, small miscellaneous jobs
3790 where you need a simple, general purpose tool, it's just the thing.
3792 On the other hand, an experienced carpenter doesn't build a house using
3793 a Swiss Army knife. Instead, he has a toolbox chock full of specialized
3794 tools---a saw, a hammer, a screwdriver, a plane, and so on. And he knows
3795 exactly when and where to use each tool; you won't catch him hammering nails
3796 with the handle of his screwdriver.
3798 The Unix developers at Bell Labs were all professional programmers and trained
3799 computer scientists. They had found that while a one-size-fits-all program
3800 might appeal to a user because there's only one program to use, in practice
3808 difficult to maintain and
3812 difficult to extend to meet new situations.
3815 Instead, they felt that programs should be specialized tools. In short, each
3816 program ``should do one thing well.'' No more and no less. Such programs are
3817 simpler to design, write, and get right---they only do one thing.
3819 Furthermore, they found that with the right machinery for hooking programs
3820 together, that the whole was greater than the sum of the parts. By combining
3821 several special purpose programs, you could accomplish a specific task
3822 that none of the programs was designed for, and accomplish it much more
3823 quickly and easily than if you had to write a special purpose program.
3824 We will see some (classic) examples of this further on in the column.
3825 (An important additional point was that, if necessary, take a detour
3826 and build any software tools you may need first, if you don't already
3827 have something appropriate in the toolbox.)
3829 @node I/O redirection
3830 @unnumberedsec I/O redirection
3832 Hopefully, you are familiar with the basics of I/O redirection in the
3833 shell, in particular the concepts of ``standard input,'' ``standard output,''
3834 and ``standard error''. Briefly, ``standard input'' is a data source, where
3835 data comes from. A program should not need to either know or care if the
3836 data source is a disk file, a keyboard, a magnetic tape, or even a punched
3837 card reader. Similarly, ``standard output'' is a data sink, where data goes
3838 to. The program should neither know nor care where this might be.
3839 Programs that only read their standard input, do something to the data,
3840 and then send it on, are called ``filters'', by analogy to filters in a
3843 With the Unix shell, it's very easy to set up data pipelines:
3846 program_to_create_data | filter1 | .... | filterN > final.pretty.data
3849 We start out by creating the raw data; each filter applies some successive
3850 transformation to the data, until by the time it comes out of the pipeline,
3851 it is in the desired form.
3853 This is fine and good for standard input and standard output. Where does the
3854 standard error come in to play? Well, think about @code{filter1} in
3855 the pipeline above. What happens if it encounters an error in the data it
3856 sees? If it writes an error message to standard output, it will just
3857 disappear down the pipeline into @code{filter2}'s input, and the
3858 user will probably never see it. So programs need a place where they can send
3859 error messages so that the user will notice them. This is standard error,
3860 and it is usually connected to your console or window, even if you have
3861 redirected standard output of your program away from your screen.
3863 For filter programs to work together, the format of the data has to be
3864 agreed upon. The most straightforward and easiest format to use is simply
3865 lines of text. Unix data files are generally just streams of bytes, with
3866 lines delimited by the @sc{ascii} @sc{lf} (Line Feed) character,
3867 conventionally called a ``newline'' in the Unix literature. (This is
3868 @code{'\n'} if you're a C programmer.) This is the format used by all
3869 the traditional filtering programs. (Many earlier operating systems
3870 had elaborate facilities and special purpose programs for managing
3871 binary data. Unix has always shied away from such things, under the
3872 philosophy that it's easiest to simply be able to view and edit your
3873 data with a text editor.)
3875 OK, enough introduction. Let's take a look at some of the tools, and then
3876 we'll see how to hook them together in interesting ways. In the following
3877 discussion, we will only present those command line options that interest
3878 us. As you should always do, double check your system documentation
3881 @node The who command
3882 @unnumberedsec The @code{who} command
3884 The first program is the @code{who} command. By itself, it generates a
3885 list of the users who are currently logged in. Although I'm writing
3886 this on a single-user system, we'll pretend that several people are
3891 arnold console Jan 22 19:57
3892 miriam ttyp0 Jan 23 14:19(:0.0)
3893 bill ttyp1 Jan 21 09:32(:0.0)
3894 arnold ttyp2 Jan 23 20:48(:0.0)
3897 Here, the @samp{$} is the usual shell prompt, at which I typed @code{who}.
3898 There are three people logged in, and I am logged in twice. On traditional
3899 Unix systems, user names are never more than eight characters long. This
3900 little bit of trivia will be useful later. The output of @code{who} is nice,
3901 but the data is not all that exciting.
3903 @node The cut command
3904 @unnumberedsec The @code{cut} command
3906 The next program we'll look at is the @code{cut} command. This program
3907 cuts out columns or fields of input data. For example, we can tell it
3908 to print just the login name and full name from the @file{/etc/passwd
3909 file}. The @file{/etc/passwd} file has seven fields, separated by
3913 arnold:xyzzy:2076:10:Arnold D. Robbins:/home/arnold:/bin/ksh
3916 To get the first and fifth fields, we would use cut like this:
3919 $ cut -d: -f1,5 /etc/passwd
3922 arnold:Arnold D. Robbins
3923 miriam:Miriam A. Robbins
3927 With the @samp{-c} option, @code{cut} will cut out specific characters
3928 (i.e., columns) in the input lines. This command looks like it might be
3929 useful for data filtering.
3932 @node The sort command
3933 @unnumberedsec The @code{sort} command
3935 Next we'll look at the @code{sort} command. This is one of the most
3936 powerful commands on a Unix-style system; one that you will often find
3937 yourself using when setting up fancy data plumbing. The @code{sort}
3938 command reads and sorts each file named on the command line. It then
3939 merges the sorted data and writes it to standard output. It will read
3940 standard input if no files are given on the command line (thus
3941 making it into a filter). The sort is based on the character collating
3942 sequence or based on user-supplied ordering criteria.
3945 @node The uniq command
3946 @unnumberedsec The @code{uniq} command
3948 Finally (at least for now), we'll look at the @code{uniq} program. When
3949 sorting data, you will often end up with duplicate lines, lines that
3950 are identical. Usually, all you need is one instance of each line.
3951 This is where @code{uniq} comes in. The @code{uniq} program reads its
3952 standard input, which it expects to be sorted. It only prints out one
3953 copy of each duplicated line. It does have several options. Later on,
3954 we'll use the @samp{-c} option, which prints each unique line, preceded
3955 by a count of the number of times that line occurred in the input.
3958 @node Putting the tools together
3959 @unnumberedsec Putting the tools together
3961 Now, let's suppose this is a large BBS system with dozens of users
3962 logged in. The management wants the SysOp to write a program that will
3963 generate a sorted list of logged in users. Furthermore, even if a user
3964 is logged in multiple times, his or her name should only show up in the
3967 The SysOp could sit down with the system documentation and write a C
3968 program that did this. It would take perhaps a couple of hundred lines
3969 of code and about two hours to write it, test it, and debug it.
3970 However, knowing the software toolbox, the SysOp can instead start out
3971 by generating just a list of logged on users:
3981 Next, sort the list:
3984 $ who | cut -c1-8 | sort
3991 Finally, run the sorted list through @code{uniq}, to weed out duplicates:
3994 $ who | cut -c1-8 | sort | uniq
4000 The @code{sort} command actually has a @samp{-u} option that does what
4001 @code{uniq} does. However, @code{uniq} has other uses for which one
4002 cannot substitute @samp{sort -u}.
4004 The SysOp puts this pipeline into a shell script, and makes it available for
4005 all the users on the system:
4008 # cat > /usr/local/bin/listusers
4009 who | cut -c1-8 | sort | uniq
4011 # chmod +x /usr/local/bin/listusers
4014 There are four major points to note here. First, with just four
4015 programs, on one command line, the SysOp was able to save about two
4016 hours worth of work. Furthermore, the shell pipeline is just about as
4017 efficient as the C program would be, and it is much more efficient in
4018 terms of programmer time. People time is much more expensive than
4019 computer time, and in our modern ``there's never enough time to do
4020 everything'' society, saving two hours of programmer time is no mean
4023 Second, it is also important to emphasize that with the
4024 @emph{combination} of the tools, it is possible to do a special
4025 purpose job never imagined by the authors of the individual programs.
4027 Third, it is also valuable to build up your pipeline in stages, as we did here.
4028 This allows you to view the data at each stage in the pipeline, which helps
4029 you acquire the confidence that you are indeed using these tools correctly.
4031 Finally, by bundling the pipeline in a shell script, other users can use
4032 your command, without having to remember the fancy plumbing you set up for
4033 them. In terms of how you run them, shell scripts and compiled programs are
4036 After the previous warm-up exercise, we'll look at two additional, more
4037 complicated pipelines. For them, we need to introduce two more tools.
4039 The first is the @code{tr} command, which stands for ``transliterate.''
4040 The @code{tr} command works on a character-by-character basis, changing
4041 characters. Normally it is used for things like mapping upper case to
4045 $ echo ThIs ExAmPlE HaS MIXED case! | tr '[A-Z]' '[a-z]'
4046 this example has mixed case!
4049 There are several options of interest:
4053 work on the complement of the listed characters, i.e.,
4054 operations apply to characters not in the given set
4057 delete characters in the first set from the output
4060 squeeze repeated characters in the output into just one character.
4063 We will be using all three options in a moment.
4065 The other command we'll look at is @code{comm}. The @code{comm}
4066 command takes two sorted input files as input data, and prints out the
4067 files' lines in three columns. The output columns are the data lines
4068 unique to the first file, the data lines unique to the second file, and
4069 the data lines that are common to both. The @samp{-1}, @samp{-2}, and
4070 @samp{-3} command line options omit the respective columns. (This is
4071 non-intuitive and takes a little getting used to.) For example:
4093 The single dash as a filename tells @code{comm} to read standard input
4094 instead of a regular file.
4096 Now we're ready to build a fancy pipeline. The first application is a word
4097 frequency counter. This helps an author determine if he or she is over-using
4100 The first step is to change the case of all the letters in our input file
4101 to one case. ``The'' and ``the'' are the same word when doing counting.
4104 $ tr '[A-Z]' '[a-z]' < whats.gnu | ...
4107 The next step is to get rid of punctuation. Quoted words and unquoted words
4108 should be treated identically; it's easiest to just get the punctuation out of
4112 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' | ...
4115 The second @code{tr} command operates on the complement of the listed
4116 characters, which are all the letters, the digits, the underscore, and
4117 the blank. The @samp{\012} represents the newline character; it has to
4118 be left alone. (The @sc{ascii} tab character should also be included for
4119 good measure in a production script.)
4121 At this point, we have data consisting of words separated by blank space.
4122 The words only contain alphanumeric characters (and the underscore). The
4123 next step is break the data apart so that we have one word per line. This
4124 makes the counting operation much easier, as we will see shortly.
4127 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4128 > tr -s '[ ]' '\012' | ...
4131 This command turns blanks into newlines. The @samp{-s} option squeezes
4132 multiple newline characters in the output into just one. This helps us
4133 avoid blank lines. (The @samp{>} is the shell's ``secondary prompt.''
4134 This is what the shell prints when it notices you haven't finished
4135 typing in all of a command.)
4137 We now have data consisting of one word per line, no punctuation, all one
4138 case. We're ready to count each word:
4141 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4142 > tr -s '[ ]' '\012' | sort | uniq -c | ...
4145 At this point, the data might look something like this:
4158 The output is sorted by word, not by count! What we want is the most
4159 frequently used words first. Fortunately, this is easy to accomplish,
4160 with the help of two more @code{sort} options:
4164 do a numeric sort, not a textual one
4167 reverse the order of the sort
4170 The final pipeline looks like this:
4173 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4174 > tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
4183 Whew! That's a lot to digest. Yet, the same principles apply. With six
4184 commands, on two lines (really one long one split for convenience), we've
4185 created a program that does something interesting and useful, in much
4186 less time than we could have written a C program to do the same thing.
4188 A minor modification to the above pipeline can give us a simple spelling
4189 checker! To determine if you've spelled a word correctly, all you have to
4190 do is look it up in a dictionary. If it is not there, then chances are
4191 that your spelling is incorrect. So, we need a dictionary. If you
4192 have the Slackware Linux distribution, you have the file
4193 @file{/usr/lib/ispell/ispell.words}, which is a sorted, 38,400 word
4196 Now, how to compare our file with the dictionary? As before, we generate
4197 a sorted list of words, one per line:
4200 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4201 > tr -s '[ ]' '\012' | sort -u | ...
4204 Now, all we need is a list of words that are @emph{not} in the
4205 dictionary. Here is where the @code{comm} command comes in.
4208 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
4209 > tr -s '[ ]' '\012' | sort -u |
4210 > comm -23 - /usr/lib/ispell/ispell.words
4213 The @samp{-2} and @samp{-3} options eliminate lines that are only in the
4214 dictionary (the second file), and lines that are in both files. Lines
4215 only in the first file (standard input, our stream of words), are
4216 words that are not in the dictionary. These are likely candidates for
4217 spelling errors. This pipeline was the first cut at a production
4218 spelling checker on Unix.
4220 There are some other tools that deserve brief mention.
4224 search files for text that matches a regular expression
4227 like @code{grep}, but with more powerful regular expressions
4230 count lines, words, characters
4233 a T-fitting for data pipes, copies data to files and to standard output
4236 the stream editor, an advanced tool
4239 a data manipulation language, another advanced tool
4242 The software tools philosophy also espoused the following bit of
4243 advice: ``Let someone else do the hard part.'' This means, take
4244 something that gives you most of what you need, and then massage it the
4245 rest of the way until it's in the form that you want.
4251 Each program should do one thing well. No more, no less.
4254 Combining programs with appropriate plumbing leads to results where
4255 the whole is greater than the sum of the parts. It also leads to novel
4256 uses of programs that the authors might never have imagined.
4259 Programs should never print extraneous header or trailer data, since these
4260 could get sent on down a pipeline. (A point we didn't mention earlier.)
4263 Let someone else do the hard part.
4266 Know your toolbox! Use each program appropriately. If you don't have an
4267 appropriate tool, build one.
4270 As of this writing, all the programs we've discussed are available via
4271 anonymous @code{ftp} from @code{prep.ai.mit.edu} as
4272 @file{/pub/gnu/textutils-1.9.tar.gz}.@footnote{Version 1.9 was current
4273 when this column was written. Check the nearest GNU archive for the
4274 current version. The main GNU FTP site is now @code{ftp.gnu.org}.}
4276 None of what I have presented in this column is new. The Software Tools
4277 philosophy was first introduced in the book @cite{Software Tools},
4278 by Brian Kernighan and P.J. Plauger (Addison-Wesley, ISBN
4279 0-201-03669-X). This book showed how to write and use software
4280 tools. It was written in 1976, using a preprocessor for FORTRAN named
4281 @code{ratfor} (RATional FORtran). At the time, C was not as ubiquitous
4282 as it is now; FORTRAN was. The last chapter presented a @code{ratfor}
4283 to FORTRAN processor, written in @code{ratfor}. @code{ratfor} looks an
4284 awful lot like C; if you know C, you won't have any problem following
4287 In 1981, the book was updated and made available as @cite{Software
4288 Tools in Pascal} (Addison-Wesley, ISBN 0-201-10342-7). Both books
4289 remain in print, and are well worth reading if you're a programmer.
4290 They certainly made a major change in how I view programming.
4292 Initially, the programs in both books were available (on 9-track tape)
4293 from Addison-Wesley. Unfortunately, this is no longer the case,
4294 although you might be able to find copies floating around the Internet.
4295 For a number of years, there was an active Software Tools Users Group,
4296 whose members had ported the original @code{ratfor} programs to essentially
4297 every computer system with a FORTRAN compiler. The popularity of the
4298 group waned in the middle '80s as Unix began to spread beyond universities.
4300 With the current proliferation of GNU code and other clones of Unix programs,
4301 these programs now receive little attention; modern C versions are
4302 much more efficient and do more than these programs do. Nevertheless, as
4303 exposition of good programming style, and evangelism for a still-valuable
4304 philosophy, these books are unparalleled, and I recommend them highly.
4306 Acknowledgment: I would like to express my gratitude to Brian Kernighan
4307 of Bell Labs, the original Software Toolsmith, for reviewing this column.
4319 @c texinfo-column-for-description: 32