dist/bzip2/bzip2.1

   1 .\"     $NetBSD: bzip2.1,v 1.9 2010/05/14 16:43:34 joerg Exp $
   2 .\"
   3 .Dd May 14, 2010
   4 .Dt BZIP2 1
   5 .Os
   6 .Sh NAME
   7 .Nm bzip2 ,
   8 .Nm bunzip2 ,
   9 .Nm bzcat ,
  10 .Nm bzip2recover
  11 .Nd block-sorting file compressor
  12 .Sh SYNOPSIS
  13 .Nm bzip2
  14 .Op Fl 123456789cdfkLqstVvz
  15 .Op Ar filename Ar
  16 .Pp
  17 .Nm bunzip2
  18 .Op Fl fkLVvs
  19 .Op Ar filename Ar
  20 .Pp
  21 .Nm bzcat
  22 .Op Fl s
  23 .Op Ar filename Ar
  24 .Pp
  25 .Nm bzip2recover
  26 .Ar filename
  27 .Sh DESCRIPTION
  28 .Nm bzip2
  29 compresses files using the Burrows-Wheeler block sorting
  30 text compression algorithm, and Huffman coding.
  31 Compression is generally considerably better than that achieved by
  32 more conventional LZ77/LZ78-based compressors, and approaches the
  33 performance of the PPM family of statistical compressors.
  34 .Pp
  35 .Nm bzcat
  36 decompresses files to stdout, and
  37 .Nm bzip2recover
  38 recovers data from damaged bzip2 files.
  39 .Pp
  40 The command-line options are deliberately very similar to
  41 those of
  42 .Xr gzip 1 ,
  43 but they are not identical.
  44 .Pp
  45 .Nm bzip2
  46 expects a list of file names to accompany the command-line flags.
  47 Each file is replaced by a compressed version of
  48 itself, with the name
  49 .Dq Pa original_name.bz2 .
  50 Each compressed file has the same modification date, permissions, and,
  51 when possible, ownership as the corresponding original, so that these
  52 properties can be correctly restored at decompression time.
  53 File name handling is naive in the sense that there is no mechanism
  54 for preserving original file names, permissions, ownerships or dates
  55 in filesystems which lack these concepts, or have serious file name
  56 length restrictions, such as
  57 .Tn MS-DOS .
  58 .Nm bzip2
  59 and
  60 .Nm bunzip2
  61 will by default not overwrite existing files.
  62 If you want this to happen, specify the
  63 .Fl f
  64 flag.
  65 .Pp
  66 If no file names are specified,
  67 .Nm bzip2
  68 compresses from standard input to standard output.
  69 In this case,
  70 .Nm bzip2
  71 will decline to write compressed output to a terminal, as this would
  72 be entirely incomprehensible and therefore pointless.
  73 .Pp
  74 .Nm bunzip2
  75 (or
  76 .Nm bzip2 Fl d )
  77 decompresses all specified files.
  78 Files which were not created by
  79 .Nm bzip2
  80 will be detected and ignored, and a warning issued.
  81 .Nm bzip2
  82 attempts to guess the filename for the decompressed file
  83 from that of the compressed file as follows:
  84 .Bl -column "filename.tbz2" "becomes" -offset indent
  85 .It Pa filename.bz2  Ta becomes Ta Pa filename
  86 .It Pa filename.bz   Ta becomes Ta Pa filename
  87 .It Pa filename.tbz2 Ta becomes Ta Pa filename.tar
  88 .It Pa filename.tbz  Ta becomes Ta Pa filename.tar
  89 .It Pa anyothername  Ta becomes Ta Pa anyothername.out
  90 .El
  91 .Pp
  92 If the file does not end in one of the recognised endings,
  93 .Pa .bz2 ,
  94 .Pa .bz ,
  95 .Pa .tbz2 ,
  96 or
  97 .Pa .tbz ,
  98 .Nm bzip2
  99 complains that it cannot guess the name of the original file, and uses
 100 the original name with
 101 .Pa .out
 102 appended.
 103 .Pp
 104 As with compression, supplying no filenames causes decompression from
 105 standard input to standard output.
 106 .Pp
 107 .Nm bunzip2
 108 will correctly decompress a file which is the concatenation of two or
 109 more compressed files.
 110 The result is the concatenation of the corresponding uncompressed
 111 files.
 112 Integrity testing
 113 .Pq Fl t
 114 of concatenated compressed files is also supported.
 115 .Pp
 116 You can also compress or decompress files to the standard output by
 117 giving the
 118 .Fl c
 119 flag.
 120 Multiple files may be compressed and decompressed like this.
 121 The resulting outputs are fed sequentially to stdout.
 122 Compression of multiple files in this manner generates a stream
 123 containing multiple compressed file representations.
 124 Such a stream can be decompressed correctly only by
 125 .Nm bzip2
 126 version 0.9.0 or later.
 127 Earlier versions of
 128 .Nm bzip2
 129 will stop after decompressing
 130 the first file in the stream.
 131 .Pp
 132 .Nm bzcat
 133 (or
 134 .Nm bzip2 Fl dc )
 135 decompresses all specified files to the standard output.
 136 .Pp
 137 Compression is always performed, even if the compressed file is
 138 slightly larger than the original.
 139 Files of less than about one hundred bytes tend to get larger, since
 140 the compression mechanism has a constant overhead in the region of 50
 141 bytes.
 142 Random data (including the output of most file compressors) is coded
 143 at about 8.05 bits per byte, giving an expansion of around 0.5%.
 144 .Pp
 145 As a self-check for your protection,
 146 .Nm bzip2
 147 uses 32-bit CRCs to make sure that the decompressed version of a file
 148 is identical to the original.
 149 This guards against corruption of the compressed data, and against
 150 undetected bugs in
 151 .Nm bzip2
 152 (hopefully very unlikely).
 153 The chances of data corruption going undetected is microscopic, about
 154 one chance in four billion for each file processed.
 155 Be aware, though, that the check occurs upon decompression, so it can
 156 only tell you that something is wrong.
 157 It can't help you recover the original uncompressed data.
 158 You can use
 159 .Nm bzip2recover
 160 to try to recover data from
 161 damaged files.
 162 .Sh OPTIONS
 163 .Bl -tag -width "XXrepetitiveXfastXX"
 164 .It Fl Fl
 165 Treats all subsequent arguments as file names, even if they start with
 166 a dash.
 167 This is so you can handle files with names beginning with a dash, for
 168 example:
 169 .Dl bzip2 -- -myfilename .
 170 .It Fl 1 , Fl Fl fast
 171 to
 172 .It Fl 9 , Fl Fl best
 173 Set the block size to 100 k, 200 k ... 900 k when compressing.
 174 Has no effect when decompressing.
 175 See
 176 .Sx MEMORY MANAGEMENT
 177 below.
 178 The
 179 .Fl Fl fast
 180 and
 181 .Fl Fl best
 182 aliases are primarily for GNU
 183 .Xr gzip 1
 184 compatibility.
 185 In particular,
 186 .Fl Fl fast
 187 doesn't make things significantly faster, and
 188 .Fl Fl best
 189 merely selects the default behaviour.
 190 .It Fl c , Fl Fl stdout
 191 Compress or decompress to standard output.
 192 .It Fl d , Fl Fl decompress
 193 Force decompression.
 194 .Nm bzip2 ,
 195 .Nm bunzip2 ,
 196 and
 197 .Nm bzcat
 198 are really the same program, and the decision about what actions to
 199 take is done on the basis of which name is used.
 200 This flag overrides that mechanism, and forces
 201 .Nm bzip2
 202 to decompress.
 203 .It Fl f , Fl Fl force
 204 Force overwrite of output files.
 205 Normally,
 206 .Nm bzip2
 207 will not overwrite existing output files.
 208 Also forces
 209 .Nm bzip2
 210 to break hard links
 211 to files, which it otherwise wouldn't do.
 212 .Pp
 213 .Nm bzip2
 214 normally declines to decompress files which don't have the correct
 215 magic header bytes.
 216 If forced
 217 .Pq Fl f ,
 218 however, it will pass such files through unmodified.
 219 This is how GNU
 220 .Xr gzip 1
 221 behaves.
 222 .It Fl k , Fl Fl keep
 223 Keep (don't delete) input files during compression
 224 or decompression.
 225 .It Fl L , Fl Fl license
 226 Display the license terms and conditions.
 227 .It Fl q , Fl Fl quiet
 228 Suppress non-essential warning messages.
 229 Messages pertaining to I/O errors and other critical events will not
 230 be suppressed.
 231 .It Fl Fl repetitive-fast
 232 .It Fl Fl repetitive-best
 233 These flags are redundant in versions 0.9.5 and above.
 234 They provided some coarse control over the behaviour of the sorting
 235 algorithm in earlier versions, which was sometimes useful.
 236 0.9.5 and above have an improved algorithm which renders these flags
 237 irrelevant.
 238 .It Fl s , Fl Fl small
 239 Reduce memory usage, for compression, decompression and testing.
 240 Files are decompressed and tested using a modified algorithm which
 241 only requires 2.5 bytes per block byte.
 242 This means any file can be decompressed in 2300k of memory, albeit at
 243 about half the normal speed.
 244 During compression,
 245 .Fl s
 246 selects a block size of 200k, which limits memory use to around the
 247 same figure, at the expense of your compression ratio.
 248 In short, if your machine is low on memory (8 megabytes or less), use
 249 .Fl s
 250 for everything.
 251 See
 252 .Sx MEMORY MANAGEMENT
 253 below.
 254 .It Fl t , Fl Fl test
 255 Check integrity of the specified file(s), but don't decompress them.
 256 This really performs a trial decompression and throws away the result.
 257 .It Fl V , Fl Fl version
 258 Display the software version.
 259 .It Fl v , Fl Fl verbose
 260 Verbose mode: show the compression ratio for each file processed.
 261 Further
 262 .Fl v Ap s
 263 increase the verbosity level, spewing out lots of information which is
 264 primarily of interest for diagnostic purposes.
 265 .It Fl z , Fl Fl compress
 266 The complement to
 267 Fl d :
 268 forces compression, regardless of the invocation name.
 269 .El
 270 .Ss MEMORY MANAGEMENT
 271 .Nm bzip2
 272 compresses large files in blocks.
 273 The block size affects both the compression ratio achieved, and the
 274 amount of memory needed for compression and decompression.
 275 The flags
 276 .Fl 1
 277 through
 278 .Fl 9
 279 specify the block size to be 100,000 bytes through 900,000 bytes (the
 280 default) respectively.
 281 At decompression time, the block size used for compression is read
 282 from the header of the compressed file, and
 283 .Nm bunzip2
 284 then allocates itself just enough memory to decompress the file.
 285 Since block sizes are stored in compressed files, it follows that the
 286 flags
 287 .Fl 1
 288 to
 289 .Fl 9
 290 are irrelevant to and so ignored during decompression.
 291 .Pp
 292 Compression and decompression requirements, in bytes, can be estimated
 293 as:
 294 .Bl -tag -width "Decompression:" -offset indent
 295 .It Compression :
 296 400k + ( 8 x block size )
 297 .It Decompression :
 298 100k + ( 4 x block size ), or 100k + ( 2.5 x block size )
 299 .El
 300 Larger block sizes give rapidly diminishing marginal returns.
 301 Most of the compression comes from the first two or three hundred k of
 302 block size, a fact worth bearing in mind when using
 303 .Nm bzip2
 304 on small machines.
 305 It is also important to appreciate that the decompression memory
 306 requirement is set at compression time by the choice of block size.
 307 .Pp
 308 For files compressed with the default 900k block size,
 309 .Nm bunzip2
 310 will require about 3700 kbytes to decompress.
 311 To support decompression of any file on a 4 megabyte machine,
 312 .Nm bunzip2
 313 has an option to decompress using approximately half this amount of
 314 memory, about 2300 kbytes.
 315 Decompression speed is also halved, so you should use this option only
 316 where necessary.
 317 The relevant flag is
 318 .Fl s .
 319 .Pp
 320 In general, try and use the largest block size memory constraints
 321 allow, since that maximises the compression achieved.
 322 Compression and decompression speed are virtually unaffected by block
 323 size.
 324 .Pp
 325 Another significant point applies to files which fit in a single block
 326 -- that means most files you'd encounter using a large block size.
 327 The amount of real memory touched is proportional to the size of the
 328 file, since the file is smaller than a block.
 329 For example, compressing a file 20,000 bytes long with the flag
 330 .Fl 9
 331 will cause the compressor to allocate around 7600k of memory, but only
 332 touch 400k + 20000 * 8 = 560 kbytes of it.
 333 Similarly, the decompressor will allocate 3700k but only touch 100k +
 334 20000 * 4 = 180 kbytes.
 335 .Pp
 336 Here is a table which summarises the maximum memory usage for different
 337 block sizes.
 338 Also recorded is the total compressed size for 14 files of the Calgary
 339 Text Compression Corpus totalling 3,141,622 bytes.
 340 This column gives some feel for how compression varies with block size.
 341 These figures tend to understate the advantage of larger block sizes
 342 for larger files, since the Corpus is dominated by smaller files.
 343 .Bl -column "Flag" "Compression" "Decompression" "DecompressionXXs" "Corpus size"
 344 .It Sy Flag Ta Sy Compression Ta Sy Decompression Ta Sy Decompression Fl s Ta Sy Corpus size
 345 .It -1 Ta 1200k Ta  500k Ta  350k Ta 914704
 346 .It -2 Ta 2000k Ta  900k Ta  600k Ta 877703
 347 .It -3 Ta 2800k Ta 1300k Ta  850k Ta 860338
 348 .It -4 Ta 3600k Ta 1700k Ta 1100k Ta 846899
 349 .It -5 Ta 4400k Ta 2100k Ta 1350k Ta 845160
 350 .It -6 Ta 5200k Ta 2500k Ta 1600k Ta 838626
 351 .It -7 Ta 6100k Ta 2900k Ta 1850k Ta 834096
 352 .It -8 Ta 6800k Ta 3300k Ta 2100k Ta 828642
 353 .It -9 Ta 7600k Ta 3700k Ta 2350k Ta 828642
 354 .El
 355 .Ss RECOVERING DATA FROM DAMAGED FILES
 356 .Nm bzip2
 357 compresses files in blocks, usually 900kbytes long.
 358 Each block is handled independently.
 359 If a media or transmission error causes a multi-block
 360 .Pa .bz2
 361 file to become damaged, it may be possible to recover data from the
 362 undamaged blocks in the file.
 363 .Pp
 364 The compressed representation of each block is delimited by a 48-bit
 365 pattern, which makes it possible to find the block boundaries with
 366 reasonable certainty.
 367 Each block also carries its own 32-bit CRC, so damaged blocks can be
 368 distinguished from undamaged ones.
 369 .Pp
 370 .Nm bzip2recover
 371 is a simple program whose purpose is to search for blocks in
 372 .Pa .bz2
 373 files, and write each block out into its own
 374 .Pa .bz2
 375 file.
 376 You can then use
 377 .Nm bzip2
 378 .Fl t
 379 to test the integrity of the resulting files, and decompress those
 380 which are undamaged.
 381 .Pp
 382 .Nm bzip2recover
 383 takes a single argument, the name of the damaged file, and writes a
 384 number of files
 385 .Dq Pa rec00001file.bz2 ,
 386 .Dq Pa rec00002file.bz2 ,
 387 etc., containing the extracted blocks.
 388 The output filenames are designed so that the use of wildcards in
 389 subsequent processing -- for example,
 390 .Dl bzip2 -dc rec*file.bz2 \*[Gt] recovered_data
 391 -- processes the files in the correct order.
 392 .Pp
 393 .Nm bzip2recover
 394 should be of most use dealing with large
 395 .Pa .bz2
 396 files, as these will contain many blocks.
 397 It is clearly futile to use it on damaged single-block files, since a
 398 damaged block cannot be recovered.
 399 If you wish to minimise any potential data loss through media or
 400 transmission errors, you might consider compressing with a smaller
 401 block size.
 402 .Ss PERFORMANCE NOTES
 403 The sorting phase of compression gathers together similar strings in
 404 the file.
 405 Because of this, files containing very long runs of repeated
 406 symbols, like
 407 .Dq aabaabaabaab...
 408 (repeated several hundred times) may compress more slowly than normal.
 409 Versions 0.9.5 and above fare much better than previous versions in
 410 this respect.
 411 The ratio between worst-case and average-case compression time is in
 412 the region of 10:1.
 413 For previous versions, this figure was more like 100:1.
 414 You can use the
 415 .Fl vvvv
 416 option to monitor progress in great detail, if you want.
 417 .Pp
 418 Decompression speed is unaffected by these phenomena.
 419 .Pp
 420 .Nm bzip2
 421 usually allocates several megabytes of memory to operate in, and then
 422 charges all over it in a fairly random fashion.
 423 This means that performance, both for compressing and decompressing,
 424 is largely determined by the speed at which your machine can service
 425 cache misses.
 426 Because of this, small changes to the code to reduce the miss rate
 427 have been observed to give disproportionately large performance
 428 improvements.
 429 I imagine
 430 .Nm bzip2
 431 will perform best on machines with very large caches.
 432 .Sh ENVIRONMENT
 433 .Nm bzip2
 434 will read arguments from the environment variables
 435 .Ev BZIP2
 436 and
 437 .Ev BZIP ,
 438 in that order, and will process them before any arguments read from
 439 the command line.
 440 This gives a convenient way to supply default arguments.
 441 .Sh EXIT STATUS
 442 0 for a normal exit, 1 for environmental problems (file not found,
 443 invalid flags, I/O errors, etc.), 2 to indicate a corrupt compressed
 444 file, 3 for an internal consistency error (e.g., bug) which caused
 445 .Nm bzip2
 446 to panic.
 447 .Sh AUTHORS
 448 .An -nosplit
 449 .An Julian Seward
 450 .Aq jseward@bzip.org
 451 .Pp
 452 .Pa http://www.bzip.org
 453 .Pp
 454 The ideas embodied in
 455 .Nm bzip2
 456 are due to (at least) the following people:
 457 .An Michael Burrows
 458 and
 459 .An David Wheeler
 460 (for the block sorting transformation),
 461 .An David Wheeler
 462 (again, for the Huffman coder),
 463 .An Peter Fenwick
 464 (for the structured coding model in the original
 465 .Nm bzip ,
 466 and many refinements), and
 467 .An Alistair Moffat ,
 468 .An Radford Neal ,
 469 and
 470 .An Ian Witten
 471 (for the arithmetic coder in the original
 472 .Nm bzip ) .
 473 I am much indebted for their help, support and advice.
 474 See the manual in the source distribution for pointers to sources of
 475 documentation.
 476 Christian von Roques encouraged me to look for faster sorting
 477 algorithms, so as to speed up compression.
 478 Bela Lubkin encouraged me to improve the worst-case compression
 479 performance.
 480 Donna Robinson XMLised the documentation.
 481 The bz* scripts are derived from those of GNU gzip.
 482 Many people sent patches, helped with portability problems, lent
 483 machines, gave advice and were generally helpful.
 484 .Sh CAVEATS
 485 I/O error messages are not as helpful as they could be.
 486 .Nm bzip2
 487 tries hard to detect I/O errors and exit cleanly, but the details of
 488 what the problem is sometimes seem rather misleading.
 489 .Pp
 490 This manual page pertains to version 1.0.5 of
 491 .Nm bzip2 .
 492 Compressed data created by this version is entirely forwards and
 493 backwards compatible with the previous public releases, versions
 494 0.1pl2, 0.9.0, 0.9.5, 1.0.0, 1.0.1, 1.0.2 and 1.0.3, but with the
 495 following exception: 0.9.0 and above can correctly decompress multiple
 496 concatenated compressed files.
 497 0.1pl2 cannot do this; it will stop after decompressing just the first
 498 file in the stream.
 499 .Pp
 500 .Nm bzip2recover
 501 versions prior to 1.0.2 used 32-bit integers to represent bit
 502 positions in compressed files, so they could not handle compressed
 503 files more than 512 megabytes long.
 504 Versions 1.0.2 and above use 64-bit ints on some platforms which
 505 support them (GNU supported targets, and Windows).
 506 To establish whether or not
 507 .Nm bzip2recover
 508 was built with such a limitation, run it without arguments.
 509 In any event you can build yourself an unlimited version if you can
 510 recompile it with MaybeUInt64 set to be an unsigned 64-bit integer.