bzip2.1

   1 .PU
   2 .TH bzip2 1
   3 .SH NAME
   4 bzip2, bunzip2 \- a block-sorting file compressor, v1.0.4
   5 .br
   6 bzcat \- decompresses files to stdout
   7 .br
   8 bzip2recover \- recovers data from damaged bzip2 files
   9
  10 .SH SYNOPSIS
  11 .ll +8
  12 .B bzip2
  13 .RB [ " \-cdfkqstvzVL123456789 " ]
  14 [
  15 .I "filenames \&..."
  16 ]
  17 .br
  18 .B bzip2
  19 .RB [ " \-h|--help " ]
  20 .ll -8
  21 .br
  22 .B bunzip2
  23 .RB [ " \-fkvsVL " ]
  24 [
  25 .I "filenames \&..."
  26 ]
  27 .br
  28 .B bunzip2
  29 .RB [ " \-h|--help " ]
  30 .br
  31 .B bzcat
  32 .RB [ " \-s " ]
  33 [
  34 .I "filenames \&..."
  35 ]
  36 .br
  37 .B bzcat
  38 .RB [ " \-h|--help " ]
  39 .br
  40 .B bzip2recover
  41 .I "filename"
  42
  43 .SH DESCRIPTION
  44 .I bzip2
  45 compresses files using the Burrows-Wheeler block sorting
  46 text compression algorithm, and Huffman coding.  Compression is
  47 generally considerably better than that achieved by more conventional
  48 LZ77/LZ78-based compressors, and approaches the performance of the PPM
  49 family of statistical compressors.
  50
  51 The command-line options are deliberately very similar to
  52 those of
  53 .I GNU gzip,
  54 but they are not identical.
  55
  56 .I bzip2
  57 expects a list of file names to accompany the
  58 command-line flags.  Each file is replaced by a compressed version of
  59 itself, with the name "original_name.bz2".
  60 Each compressed file
  61 has the same modification date, permissions, and, when possible,
  62 ownership as the corresponding original, so that these properties can
  63 be correctly restored at decompression time.  File name handling is
  64 naive in the sense that there is no mechanism for preserving original
  65 file names, permissions, ownerships or dates in filesystems which lack
  66 these concepts, or have serious file name length restrictions, such as
  67 MS-DOS.
  68
  69 .I bzip2
  70 and
  71 .I bunzip2
  72 will by default not overwrite existing
  73 files.  If you want this to happen, specify the \-f flag.
  74
  75 If no file names are specified,
  76 .I bzip2
  77 compresses from standard
  78 input to standard output.  In this case,
  79 .I bzip2
  80 will decline to
  81 write compressed output to a terminal, as this would be entirely
  82 incomprehensible and therefore pointless.
  83
  84 .I bunzip2
  85 (or
  86 .I bzip2 \-d)
  87 decompresses all
  88 specified files.  Files which were not created by
  89 .I bzip2
  90 will be detected and ignored, and a warning issued.
  91 .I bzip2
  92 attempts to guess the filename for the decompressed file
  93 from that of the compressed file as follows:
  94
  95        filename.bz2    becomes   filename
  96        filename.bz     becomes   filename
  97        filename.tbz2   becomes   filename.tar
  98        filename.tbz    becomes   filename.tar
  99        anyothername    becomes   anyothername.out
 100
 101 If the file does not end in one of the recognised endings,
 102 .I .bz2,
 103 .I .bz,
 104 .I .tbz2
 105 or
 106 .I .tbz,
 107 .I bzip2
 108 complains that it cannot
 109 guess the name of the original file, and uses the original name
 110 with
 111 .I .out
 112 appended.
 113
 114 As with compression, supplying no
 115 filenames causes decompression from
 116 standard input to standard output.
 117
 118 .I bunzip2
 119 will correctly decompress a file which is the
 120 concatenation of two or more compressed files.  The result is the
 121 concatenation of the corresponding uncompressed files.  Integrity
 122 testing (\-t)
 123 of concatenated
 124 compressed files is also supported.
 125
 126 You can also compress or decompress files to the standard output by
 127 giving the \-c flag.  Multiple files may be compressed and
 128 decompressed like this.  The resulting outputs are fed sequentially to
 129 stdout.  Compression of multiple files
 130 in this manner generates a stream
 131 containing multiple compressed file representations.  Such a stream
 132 can be decompressed correctly only by
 133 .I bzip2
 134 version 0.9.0 or
 135 later.  Earlier versions of
 136 .I bzip2
 137 will stop after decompressing
 138 the first file in the stream.
 139
 140 .I bzcat
 141 (or
 142 .I bzip2 -dc)
 143 decompresses all specified files to
 144 the standard output.
 145
 146 .I bzip2
 147 will read arguments from the environment variables
 148 .I BZIP2
 149 and
 150 .I BZIP,
 151 in that order, and will process them
 152 before any arguments read from the command line.  This gives a
 153 convenient way to supply default arguments.
 154
 155 Compression is always performed, even if the compressed
 156 file is slightly
 157 larger than the original.  Files of less than about one hundred bytes
 158 tend to get larger, since the compression mechanism has a constant
 159 overhead in the region of 50 bytes.  Random data (including the output
 160 of most file compressors) is coded at about 8.05 bits per byte, giving
 161 an expansion of around 0.5%.
 162
 163 As a self-check for your protection,
 164 .I
 165 bzip2
 166 uses 32-bit CRCs to
 167 make sure that the decompressed version of a file is identical to the
 168 original.  This guards against corruption of the compressed data, and
 169 against undetected bugs in
 170 .I bzip2
 171 (hopefully very unlikely).  The
 172 chances of data corruption going undetected is microscopic, about one
 173 chance in four billion for each file processed.  Be aware, though, that
 174 the check occurs upon decompression, so it can only tell you that
 175 something is wrong.  It can't help you
 176 recover the original uncompressed
 177 data.  You can use
 178 .I bzip2recover
 179 to try to recover data from
 180 damaged files.
 181
 182 Return values: 0 for a normal exit, 1 for environmental problems (file
 183 not found, invalid flags, I/O errors, &c), 2 to indicate a corrupt
 184 compressed file, 3 for an internal consistency error (eg, bug) which
 185 caused
 186 .I bzip2
 187 to panic.
 188
 189 .SH OPTIONS
 190 .TP
 191 .B \-c --stdout
 192 Compress or decompress to standard output.
 193 .TP
 194 .B \-d --decompress
 195 Force decompression.
 196 .I bzip2,
 197 .I bunzip2
 198 and
 199 .I bzcat
 200 are
 201 really the same program, and the decision about what actions to take is
 202 done on the basis of which name is used.  This flag overrides that
 203 mechanism, and forces
 204 .I bzip2
 205 to decompress.
 206 .TP
 207 .B \-z --compress
 208 The complement to \-d: forces compression, regardless of the
 209 invocation name.
 210 .TP
 211 .B \-t --test
 212 Check integrity of the specified file(s), but don't decompress them.
 213 This really performs a trial decompression and throws away the result.
 214 .TP
 215 .B \-f --force
 216 Force overwrite of output files.  Normally,
 217 .I bzip2
 218 will not overwrite
 219 existing output files.  Also forces
 220 .I bzip2
 221 to break hard links
 222 to files, which it otherwise wouldn't do.
 223
 224 bzip2 normally declines to decompress files which don't have the
 225 correct magic header bytes.  If forced (-f), however, it will pass
 226 such files through unmodified.  This is how GNU gzip behaves.
 227 .TP
 228 .B \-k --keep
 229 Keep (don't delete) input files during compression
 230 or decompression.
 231 .TP
 232 .B \-s --small
 233 Reduce memory usage, for compression, decompression and testing.  Files
 234 are decompressed and tested using a modified algorithm which only
 235 requires 2.5 bytes per block byte.  This means any file can be
 236 decompressed in 2300k of memory, albeit at about half the normal speed.
 237
 238 During compression, \-s selects a block size of 200k, which limits
 239 memory use to around the same figure, at the expense of your compression
 240 ratio.  In short, if your machine is low on memory (8 megabytes or
 241 less), use \-s for everything.  See MEMORY MANAGEMENT below.
 242 .TP
 243 .B \-q --quiet
 244 Suppress non-essential warning messages.  Messages pertaining to
 245 I/O errors and other critical events will not be suppressed.
 246 .TP
 247 .B \-p --show-progress
 248 Show percentage of input-file done and while compressing show the percentage
 249 of the original file the new file is.
 250 .TP
 251 .B \-v --verbose
 252 Verbose mode -- show the compression ratio for each file processed.
 253 Further \-v's increase the verbosity level, spewing out lots of
 254 information which is primarily of interest for diagnostic purposes.
 255 .TP
 256 .B \-h --help
 257 Print a help message and exit.
 258 .TP
 259 .B \-L --license -V --version
 260 Display the software version, license terms and conditions.
 261 .TP
 262 .B \-1 (or \-\-fast) to \-9 (or \-\-best)
 263 Set the block size to 100 k, 200 k ..  900 k when compressing.  Has no
 264 effect when decompressing.  See MEMORY MANAGEMENT below.
 265 The \-\-fast and \-\-best aliases are primarily for GNU gzip
 266 compatibility.  In particular, \-\-fast doesn't make things
 267 significantly faster.
 268 And \-\-best merely selects the default behaviour.
 269 .TP
 270 .B \--
 271 Treats all subsequent arguments as file names, even if they start
 272 with a dash.  This is so you can handle files with names beginning
 273 with a dash, for example: bzip2 \-- \-myfilename.
 274 .TP
 275 .B \--repetitive-fast --repetitive-best
 276 These flags are redundant in versions 0.9.5 and above.  They provided
 277 some coarse control over the behaviour of the sorting algorithm in
 278 earlier versions, which was sometimes useful.  0.9.5 and above have an
 279 improved algorithm which renders these flags irrelevant.
 280
 281 .SH MEMORY MANAGEMENT
 282 .I bzip2
 283 compresses large files in blocks.  The block size affects
 284 both the compression ratio achieved, and the amount of memory needed for
 285 compression and decompression.  The flags \-1 through \-9
 286 specify the block size to be 100,000 bytes through 900,000 bytes (the
 287 default) respectively.  At decompression time, the block size used for
 288 compression is read from the header of the compressed file, and
 289 .I bunzip2
 290 then allocates itself just enough memory to decompress
 291 the file.  Since block sizes are stored in compressed files, it follows
 292 that the flags \-1 to \-9 are irrelevant to and so ignored
 293 during decompression.
 294
 295 Compression and decompression requirements,
 296 in bytes, can be estimated as:
 297
 298        Compression:   400k + ( 8 x block size )
 299
 300        Decompression: 100k + ( 4 x block size ), or
 301                       100k + ( 2.5 x block size )
 302
 303 Larger block sizes give rapidly diminishing marginal returns.  Most of
 304 the compression comes from the first two or three hundred k of block
 305 size, a fact worth bearing in mind when using
 306 .I bzip2
 307 on small machines.
 308 It is also important to appreciate that the decompression memory
 309 requirement is set at compression time by the choice of block size.
 310
 311 For files compressed with the default 900k block size,
 312 .I bunzip2
 313 will require about 3700 kbytes to decompress.  To support decompression
 314 of any file on a 4 megabyte machine,
 315 .I bunzip2
 316 has an option to
 317 decompress using approximately half this amount of memory, about 2300
 318 kbytes.  Decompression speed is also halved, so you should use this
 319 option only where necessary.  The relevant flag is -s.
 320
 321 In general, try and use the largest block size memory constraints allow,
 322 since that maximises the compression achieved.  Compression and
 323 decompression speed are virtually unaffected by block size.
 324
 325 Another significant point applies to files which fit in a single block
 326 -- that means most files you'd encounter using a large block size.  The
 327 amount of real memory touched is proportional to the size of the file,
 328 since the file is smaller than a block.  For example, compressing a file
 329 20,000 bytes long with the flag -9 will cause the compressor to
 330 allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560
 331 kbytes of it.  Similarly, the decompressor will allocate 3700k but only
 332 touch 100k + 20000 * 4 = 180 kbytes.
 333
 334 Here is a table which summarises the maximum memory usage for different
 335 block sizes.  Also recorded is the total compressed size for 14 files of
 336 the Calgary Text Compression Corpus totalling 3,141,622 bytes.  This
 337 column gives some feel for how compression varies with block size.
 338 These figures tend to understate the advantage of larger block sizes for
 339 larger files, since the Corpus is dominated by smaller files.
 340
 341            Compress   Decompress   Decompress   Corpus
 342     Flag     usage      usage       -s usage     Size
 343
 344      -1      1200k       500k         350k      914704
 345      -2      2000k       900k         600k      877703
 346      -3      2800k      1300k         850k      860338
 347      -4      3600k      1700k        1100k      846899
 348      -5      4400k      2100k        1350k      845160
 349      -6      5200k      2500k        1600k      838626
 350      -7      6100k      2900k        1850k      834096
 351      -8      6800k      3300k        2100k      828642
 352      -9      7600k      3700k        2350k      828642
 353
 354 .SH RECOVERING DATA FROM DAMAGED FILES
 355 .I bzip2
 356 compresses files in blocks, usually 900kbytes long.  Each
 357 block is handled independently.  If a media or transmission error causes
 358 a multi-block .bz2
 359 file to become damaged, it may be possible to
 360 recover data from the undamaged blocks in the file.
 361
 362 The compressed representation of each block is delimited by a 48-bit
 363 pattern, which makes it possible to find the block boundaries with
 364 reasonable certainty.  Each block also carries its own 32-bit CRC, so
 365 damaged blocks can be distinguished from undamaged ones.
 366
 367 .I bzip2recover
 368 is a simple program whose purpose is to search for
 369 blocks in .bz2 files, and write each block out into its own .bz2
 370 file.  You can then use
 371 .I bzip2
 372 \-t
 373 to test the
 374 integrity of the resulting files, and decompress those which are
 375 undamaged.
 376
 377 .I bzip2recover
 378 takes a single argument, the name of the damaged file,
 379 and writes a number of files "rec00001file.bz2",
 380 "rec00002file.bz2", etc, containing the  extracted  blocks.
 381 The  output  filenames  are  designed  so  that the use of
 382 wildcards in subsequent processing -- for example,
 383 "bzip2 -dc  rec*file.bz2 > recovered_data" -- processes the files in
 384 the correct order.
 385
 386 .I bzip2recover
 387 should be of most use dealing with large .bz2
 388 files,  as  these will contain many blocks.  It is clearly
 389 futile to use it on damaged single-block  files,  since  a
 390 damaged  block  cannot  be recovered.  If you wish to minimise
 391 any potential data loss through media  or  transmission errors,
 392 you might consider compressing with a smaller
 393 block size.
 394
 395 .SH PERFORMANCE NOTES
 396 The sorting phase of compression gathers together similar strings in the
 397 file.  Because of this, files containing very long runs of repeated
 398 symbols, like "aabaabaabaab ..."  (repeated several hundred times) may
 399 compress more slowly than normal.  Versions 0.9.5 and above fare much
 400 better than previous versions in this respect.  The ratio between
 401 worst-case and average-case compression time is in the region of 10:1.
 402 For previous versions, this figure was more like 100:1.  You can use the
 403 \-vvvv option to monitor progress in great detail, if you want.
 404
 405 Decompression speed is unaffected by these phenomena.
 406
 407 .I bzip2
 408 usually allocates several megabytes of memory to operate
 409 in, and then charges all over it in a fairly random fashion.  This means
 410 that performance, both for compressing and decompressing, is largely
 411 determined by the speed at which your machine can service cache misses.
 412 Because of this, small changes to the code to reduce the miss rate have
 413 been observed to give disproportionately large performance improvements.
 414 I imagine
 415 .I bzip2
 416 will perform best on machines with very large caches.
 417
 418 .SH CAVEATS
 419 I/O error messages are not as helpful as they could be.
 420 .I bzip2
 421 tries hard to detect I/O errors and exit cleanly, but the details of
 422 what the problem is sometimes seem rather misleading.
 423
 424 This manual page pertains to version 1.0.4 of
 425 .I bzip2.
 426 Compressed data created by this version is entirely forwards and
 427 backwards compatible with the previous public releases, versions
 428 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and 1.0.3, but with the following
 429 exception: 0.9.0 and above can correctly decompress multiple
 430 concatenated compressed files.  0.1pl2 cannot do this; it will stop
 431 after decompressing just the first file in the stream.
 432
 433 .I bzip2recover
 434 versions prior to 1.0.2 used 32-bit integers to represent
 435 bit positions in compressed files, so they could not handle compressed
 436 files more than 512 megabytes long.  Versions 1.0.2 and above use
 437 64-bit ints on some platforms which support them (GNU supported
 438 targets, and Windows).  To establish whether or not bzip2recover was
 439 built with such a limitation, run it without arguments.  In any event
 440 you can build yourself an unlimited version if you can recompile it
 441 with MaybeUInt64 set to be an unsigned 64-bit integer.
 442
 443
 444
 445 .SH AUTHOR
 446 Julian Seward, jsewardbzip.org.
 447
 448 http://www.bzip.org
 449
 450 The ideas embodied in
 451 .I bzip2
 452 are due to (at least) the following
 453 people: Michael Burrows and David Wheeler (for the block sorting
 454 transformation), David Wheeler (again, for the Huffman coder), Peter
 455 Fenwick (for the structured coding model in the original
 456 .I bzip,
 457 and many refinements), and Alistair Moffat, Radford Neal and Ian Witten
 458 (for the arithmetic coder in the original
 459 .I bzip).
 460 I am much
 461 indebted for their help, support and advice.  See the manual in the
 462 source distribution for pointers to sources of documentation.  Christian
 463 von Roques encouraged me to look for faster sorting algorithms, so as to
 464 speed up compression.  Bela Lubkin encouraged me to improve the
 465 worst-case compression performance.
 466 Donna Robinson XMLised the documentation.
 467 The bz* scripts are derived from those of GNU gzip.
 468 Many people sent patches, helped
 469 with portability problems, lent machines, gave advice and were generally
 470 helpful.