1 .\" $NetBSD: bzip2.1,v 1.9 2010/05/14 16:43:34 joerg Exp $
11 .Nd block-sorting file compressor
14 .Op Fl 123456789cdfkLqstVvz
29 compresses files using the Burrows-Wheeler block sorting
30 text compression algorithm, and Huffman coding.
31 Compression is generally considerably better than that achieved by
32 more conventional LZ77/LZ78-based compressors, and approaches the
33 performance of the PPM family of statistical compressors.
36 decompresses files to stdout, and
38 recovers data from damaged bzip2 files.
40 The command-line options are deliberately very similar to
43 but they are not identical.
46 expects a list of file names to accompany the command-line flags.
47 Each file is replaced by a compressed version of
49 .Dq Pa original_name.bz2 .
50 Each compressed file has the same modification date, permissions, and,
51 when possible, ownership as the corresponding original, so that these
52 properties can be correctly restored at decompression time.
53 File name handling is naive in the sense that there is no mechanism
54 for preserving original file names, permissions, ownerships or dates
55 in filesystems which lack these concepts, or have serious file name
56 length restrictions, such as
61 will by default not overwrite existing files.
62 If you want this to happen, specify the
66 If no file names are specified,
68 compresses from standard input to standard output.
71 will decline to write compressed output to a terminal, as this would
72 be entirely incomprehensible and therefore pointless.
77 decompresses all specified files.
78 Files which were not created by
80 will be detected and ignored, and a warning issued.
82 attempts to guess the filename for the decompressed file
83 from that of the compressed file as follows:
84 .Bl -column "filename.tbz2" "becomes" -offset indent
85 .It Pa filename.bz2 Ta becomes Ta Pa filename
86 .It Pa filename.bz Ta becomes Ta Pa filename
87 .It Pa filename.tbz2 Ta becomes Ta Pa filename.tar
88 .It Pa filename.tbz Ta becomes Ta Pa filename.tar
89 .It Pa anyothername Ta becomes Ta Pa anyothername.out
92 If the file does not end in one of the recognised endings,
99 complains that it cannot guess the name of the original file, and uses
100 the original name with
104 As with compression, supplying no filenames causes decompression from
105 standard input to standard output.
108 will correctly decompress a file which is the concatenation of two or
109 more compressed files.
110 The result is the concatenation of the corresponding uncompressed
114 of concatenated compressed files is also supported.
116 You can also compress or decompress files to the standard output by
120 Multiple files may be compressed and decompressed like this.
121 The resulting outputs are fed sequentially to stdout.
122 Compression of multiple files in this manner generates a stream
123 containing multiple compressed file representations.
124 Such a stream can be decompressed correctly only by
126 version 0.9.0 or later.
129 will stop after decompressing
130 the first file in the stream.
135 decompresses all specified files to the standard output.
137 Compression is always performed, even if the compressed file is
138 slightly larger than the original.
139 Files of less than about one hundred bytes tend to get larger, since
140 the compression mechanism has a constant overhead in the region of 50
142 Random data (including the output of most file compressors) is coded
143 at about 8.05 bits per byte, giving an expansion of around 0.5%.
145 As a self-check for your protection,
147 uses 32-bit CRCs to make sure that the decompressed version of a file
148 is identical to the original.
149 This guards against corruption of the compressed data, and against
152 (hopefully very unlikely).
153 The chances of data corruption going undetected is microscopic, about
154 one chance in four billion for each file processed.
155 Be aware, though, that the check occurs upon decompression, so it can
156 only tell you that something is wrong.
157 It can't help you recover the original uncompressed data.
160 to try to recover data from
163 .Bl -tag -width "XXrepetitiveXfastXX"
165 Treats all subsequent arguments as file names, even if they start with
167 This is so you can handle files with names beginning with a dash, for
169 .Dl bzip2 -- -myfilename .
170 .It Fl 1 , Fl Fl fast
172 .It Fl 9 , Fl Fl best
173 Set the block size to 100 k, 200 k ... 900 k when compressing.
174 Has no effect when decompressing.
176 .Sx MEMORY MANAGEMENT
182 aliases are primarily for GNU
187 doesn't make things significantly faster, and
189 merely selects the default behaviour.
190 .It Fl c , Fl Fl stdout
191 Compress or decompress to standard output.
192 .It Fl d , Fl Fl decompress
198 are really the same program, and the decision about what actions to
199 take is done on the basis of which name is used.
200 This flag overrides that mechanism, and forces
203 .It Fl f , Fl Fl force
204 Force overwrite of output files.
207 will not overwrite existing output files.
211 to files, which it otherwise wouldn't do.
214 normally declines to decompress files which don't have the correct
218 however, it will pass such files through unmodified.
222 .It Fl k , Fl Fl keep
223 Keep (don't delete) input files during compression
225 .It Fl L , Fl Fl license
226 Display the license terms and conditions.
227 .It Fl q , Fl Fl quiet
228 Suppress non-essential warning messages.
229 Messages pertaining to I/O errors and other critical events will not
231 .It Fl Fl repetitive-fast
232 .It Fl Fl repetitive-best
233 These flags are redundant in versions 0.9.5 and above.
234 They provided some coarse control over the behaviour of the sorting
235 algorithm in earlier versions, which was sometimes useful.
236 0.9.5 and above have an improved algorithm which renders these flags
238 .It Fl s , Fl Fl small
239 Reduce memory usage, for compression, decompression and testing.
240 Files are decompressed and tested using a modified algorithm which
241 only requires 2.5 bytes per block byte.
242 This means any file can be decompressed in 2300k of memory, albeit at
243 about half the normal speed.
246 selects a block size of 200k, which limits memory use to around the
247 same figure, at the expense of your compression ratio.
248 In short, if your machine is low on memory (8 megabytes or less), use
252 .Sx MEMORY MANAGEMENT
254 .It Fl t , Fl Fl test
255 Check integrity of the specified file(s), but don't decompress them.
256 This really performs a trial decompression and throws away the result.
257 .It Fl V , Fl Fl version
258 Display the software version.
259 .It Fl v , Fl Fl verbose
260 Verbose mode: show the compression ratio for each file processed.
263 increase the verbosity level, spewing out lots of information which is
264 primarily of interest for diagnostic purposes.
265 .It Fl z , Fl Fl compress
268 forces compression, regardless of the invocation name.
270 .Ss MEMORY MANAGEMENT
272 compresses large files in blocks.
273 The block size affects both the compression ratio achieved, and the
274 amount of memory needed for compression and decompression.
279 specify the block size to be 100,000 bytes through 900,000 bytes (the
280 default) respectively.
281 At decompression time, the block size used for compression is read
282 from the header of the compressed file, and
284 then allocates itself just enough memory to decompress the file.
285 Since block sizes are stored in compressed files, it follows that the
290 are irrelevant to and so ignored during decompression.
292 Compression and decompression requirements, in bytes, can be estimated
294 .Bl -tag -width "Decompression:" -offset indent
296 400k + ( 8 x block size )
298 100k + ( 4 x block size ), or 100k + ( 2.5 x block size )
300 Larger block sizes give rapidly diminishing marginal returns.
301 Most of the compression comes from the first two or three hundred k of
302 block size, a fact worth bearing in mind when using
305 It is also important to appreciate that the decompression memory
306 requirement is set at compression time by the choice of block size.
308 For files compressed with the default 900k block size,
310 will require about 3700 kbytes to decompress.
311 To support decompression of any file on a 4 megabyte machine,
313 has an option to decompress using approximately half this amount of
314 memory, about 2300 kbytes.
315 Decompression speed is also halved, so you should use this option only
320 In general, try and use the largest block size memory constraints
321 allow, since that maximises the compression achieved.
322 Compression and decompression speed are virtually unaffected by block
325 Another significant point applies to files which fit in a single block
326 -- that means most files you'd encounter using a large block size.
327 The amount of real memory touched is proportional to the size of the
328 file, since the file is smaller than a block.
329 For example, compressing a file 20,000 bytes long with the flag
331 will cause the compressor to allocate around 7600k of memory, but only
332 touch 400k + 20000 * 8 = 560 kbytes of it.
333 Similarly, the decompressor will allocate 3700k but only touch 100k +
334 20000 * 4 = 180 kbytes.
336 Here is a table which summarises the maximum memory usage for different
338 Also recorded is the total compressed size for 14 files of the Calgary
339 Text Compression Corpus totalling 3,141,622 bytes.
340 This column gives some feel for how compression varies with block size.
341 These figures tend to understate the advantage of larger block sizes
342 for larger files, since the Corpus is dominated by smaller files.
343 .Bl -column "Flag" "Compression" "Decompression" "DecompressionXXs" "Corpus size"
344 .It Sy Flag Ta Sy Compression Ta Sy Decompression Ta Sy Decompression Fl s Ta Sy Corpus size
345 .It -1 Ta 1200k Ta 500k Ta 350k Ta 914704
346 .It -2 Ta 2000k Ta 900k Ta 600k Ta 877703
347 .It -3 Ta 2800k Ta 1300k Ta 850k Ta 860338
348 .It -4 Ta 3600k Ta 1700k Ta 1100k Ta 846899
349 .It -5 Ta 4400k Ta 2100k Ta 1350k Ta 845160
350 .It -6 Ta 5200k Ta 2500k Ta 1600k Ta 838626
351 .It -7 Ta 6100k Ta 2900k Ta 1850k Ta 834096
352 .It -8 Ta 6800k Ta 3300k Ta 2100k Ta 828642
353 .It -9 Ta 7600k Ta 3700k Ta 2350k Ta 828642
355 .Ss RECOVERING DATA FROM DAMAGED FILES
357 compresses files in blocks, usually 900kbytes long.
358 Each block is handled independently.
359 If a media or transmission error causes a multi-block
361 file to become damaged, it may be possible to recover data from the
362 undamaged blocks in the file.
364 The compressed representation of each block is delimited by a 48-bit
365 pattern, which makes it possible to find the block boundaries with
366 reasonable certainty.
367 Each block also carries its own 32-bit CRC, so damaged blocks can be
368 distinguished from undamaged ones.
371 is a simple program whose purpose is to search for blocks in
373 files, and write each block out into its own
379 to test the integrity of the resulting files, and decompress those
383 takes a single argument, the name of the damaged file, and writes a
385 .Dq Pa rec00001file.bz2 ,
386 .Dq Pa rec00002file.bz2 ,
387 etc., containing the extracted blocks.
388 The output filenames are designed so that the use of wildcards in
389 subsequent processing -- for example,
390 .Dl bzip2 -dc rec*file.bz2 \*[Gt] recovered_data
391 -- processes the files in the correct order.
394 should be of most use dealing with large
396 files, as these will contain many blocks.
397 It is clearly futile to use it on damaged single-block files, since a
398 damaged block cannot be recovered.
399 If you wish to minimise any potential data loss through media or
400 transmission errors, you might consider compressing with a smaller
402 .Ss PERFORMANCE NOTES
403 The sorting phase of compression gathers together similar strings in
405 Because of this, files containing very long runs of repeated
408 (repeated several hundred times) may compress more slowly than normal.
409 Versions 0.9.5 and above fare much better than previous versions in
411 The ratio between worst-case and average-case compression time is in
413 For previous versions, this figure was more like 100:1.
416 option to monitor progress in great detail, if you want.
418 Decompression speed is unaffected by these phenomena.
421 usually allocates several megabytes of memory to operate in, and then
422 charges all over it in a fairly random fashion.
423 This means that performance, both for compressing and decompressing,
424 is largely determined by the speed at which your machine can service
426 Because of this, small changes to the code to reduce the miss rate
427 have been observed to give disproportionately large performance
431 will perform best on machines with very large caches.
434 will read arguments from the environment variables
438 in that order, and will process them before any arguments read from
440 This gives a convenient way to supply default arguments.
442 0 for a normal exit, 1 for environmental problems (file not found,
443 invalid flags, I/O errors, etc.), 2 to indicate a corrupt compressed
444 file, 3 for an internal consistency error (e.g., bug) which caused
452 .Pa http://www.bzip.org
454 The ideas embodied in
456 are due to (at least) the following people:
460 (for the block sorting transformation),
462 (again, for the Huffman coder),
464 (for the structured coding model in the original
466 and many refinements), and
467 .An Alistair Moffat ,
471 (for the arithmetic coder in the original
473 I am much indebted for their help, support and advice.
474 See the manual in the source distribution for pointers to sources of
476 Christian von Roques encouraged me to look for faster sorting
477 algorithms, so as to speed up compression.
478 Bela Lubkin encouraged me to improve the worst-case compression
480 Donna Robinson XMLised the documentation.
481 The bz* scripts are derived from those of GNU gzip.
482 Many people sent patches, helped with portability problems, lent
483 machines, gave advice and were generally helpful.
485 I/O error messages are not as helpful as they could be.
487 tries hard to detect I/O errors and exit cleanly, but the details of
488 what the problem is sometimes seem rather misleading.
490 This manual page pertains to version 1.0.5 of
492 Compressed data created by this version is entirely forwards and
493 backwards compatible with the previous public releases, versions
494 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and 1.0.3, but with the
495 following exception: 0.9.0 and above can correctly decompress multiple
496 concatenated compressed files.
497 0.1pl2 cannot do this; it will stop after decompressing just the first
501 versions prior to 1.0.2 used 32-bit integers to represent bit
502 positions in compressed files, so they could not handle compressed
503 files more than 512 megabytes long.
504 Versions 1.0.2 and above use 64-bit ints on some platforms which
505 support them (GNU supported targets, and Windows).
506 To establish whether or not
508 was built with such a limitation, run it without arguments.
509 In any event you can build yourself an unlimited version if you can
510 recompile it with MaybeUInt64 set to be an unsigned 64-bit integer.