5 Version 1.1.0 (2022-12-11)
9 0.1. Notices and Acknowledgements
10 0.2. Getting the Latest Version
13 1.1. Byte and Its Representation
14 1.2. Multibyte Integers
15 2. Overall Structure of .xz File
18 2.1.1.1. Header Magic Bytes
23 2.1.2.2. Backward Size
25 2.1.2.4. Footer Magic Bytes
29 3.1.1. Block Header Size
31 3.1.3. Compressed Size
32 3.1.4. Uncompressed Size
33 3.1.5. List of Filter Flags
41 4.2. Number of Records
44 4.3.2. Uncompressed Size
52 5.3.2. Branch/Call/Jump Filters for Executables
54 5.3.3.1. Format of the Encoded Output
55 5.4. Custom Filter IDs
56 5.4.1. Reserved Custom Filter ID Ranges
57 6. Cyclic Redundancy Checks
63 This document describes the .xz file format (filename suffix
64 ".xz", MIME type "application/x-xz"). It is intended that this
65 this format replace the old .lzma format used by LZMA SDK and
69 0.1. Notices and Acknowledgements
71 This file format was designed by Lasse Collin
72 <lasse.collin@tukaani.org> and Igor Pavlov.
74 Special thanks for helping with this document goes to
75 Ville Koskinen. Thanks for helping with this document goes to
76 Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.
78 This document has been put into the public domain.
81 0.2. Getting the Latest Version
83 The latest official version of this document can be downloaded
84 from <http://tukaani.org/xz/xz-file-format.txt>.
86 Specific versions of this document have a filename
87 xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
88 For example, the version 1.0.0 of this document is available
89 at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>.
94 Version Date Description
96 1.1.0 2022-12-11 Added ARM64 filter and clarified 32-bit
97 ARM endianness in Section 5.3.2,
98 language improvements in Section 5.4
100 1.0.4 2009-08-27 Language improvements in Sections 1.2,
101 2.1.1.2, 3.1.1, 3.1.2, and 5.3.1
103 1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4
105 1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1
107 1.0.1 2009-06-01 Typo fix in Section 0.3 and minor
108 clarifications to Sections 2, 2.2,
111 1.0.0 2009-01-14 The first official version
116 The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
117 "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
118 document are to be interpreted as described in [RFC-2119].
120 Indicating a warning means displaying a message, returning
121 appropriate exit status, or doing something else to let the
122 user know that something worth warning occurred. The operation
123 SHOULD still finish if a warning is indicated.
125 Indicating an error means displaying a message, returning
126 appropriate exit status, or doing something else to let the
127 user know that something prevented successfully finishing the
128 operation. The operation MUST be aborted once an error has
132 1.1. Byte and Its Representation
134 In this document, byte is always 8 bits.
136 A "null byte" has all bits unset. That is, the value of a null
139 To represent byte blocks, this document uses notation that
140 is similar to the notation used in [RFC-1952]:
147 | Foo | Two bytes; that is, some of the vertical bars
148 +---+---+ can be missing.
151 | Foo | Zero or more bytes.
154 In this document, a boxed byte or a byte sequence declared
155 using this notation is called "a field". The example field
156 above would be called "the Foo field" or plain "Foo".
158 If there are many fields, they may be split to multiple lines.
159 This is indicated with an arrow ("--->"):
169 The above is equivalent to this:
176 1.2. Multibyte Integers
178 Multibyte integers of static length, such as CRC values,
179 are stored in little endian byte order (least significant
182 When smaller values are more likely than bigger values (for
183 example file sizes), multibyte integers are encoded in a
184 variable-length representation:
185 - Numbers in the range [0, 127] are copied as is, and take
187 - Bigger numbers will occupy two or more bytes. All but the
188 last byte of the multibyte representation have the highest
191 For now, the value of the variable-length integers is limited
192 to 63 bits, which limits the encoded size of the integer to
193 nine bytes. These limits may be increased in the future if
196 The following C code illustrates encoding and decoding of
197 variable-length integers. The functions return the number of
198 bytes occupied by the integer (1-9), or zero on error.
201 #include <inttypes.h>
204 encode(uint8_t buf[static 9], uint64_t num)
206 if (num > UINT64_MAX / 2)
211 while (num >= 0x80) {
212 buf[i++] = (uint8_t)(num) | 0x80;
216 buf[i++] = (uint8_t)(num);
222 decode(const uint8_t buf[], size_t size_max, uint64_t *num)
230 *num = buf[0] & 0x7F;
233 while (buf[i++] & 0x80) {
234 if (i >= size_max || buf[i] == 0x00)
237 *num |= (uint64_t)(buf[i] & 0x7F) << (i * 7);
244 2. Overall Structure of .xz File
246 A standalone .xz files consist of one or more Streams which may
247 have Stream Padding between or after them:
249 +========+================+========+================+
250 | Stream | Stream Padding | Stream | Stream Padding | ...
251 +========+================+========+================+
253 The sizes of Stream and Stream Padding are always multiples
254 of four bytes, thus the size of every valid .xz file MUST be
255 a multiple of four bytes.
257 While a typical file contains only one Stream and no Stream
258 Padding, a decoder handling standalone .xz files SHOULD support
259 files that have more than one Stream or Stream Padding.
261 In contrast to standalone .xz files, when the .xz file format
262 is used as an internal part of some other file format or
263 communication protocol, it usually is expected that the decoder
264 stops after the first Stream, and doesn't look for Stream
265 Padding or possibly other Streams.
270 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
271 | Stream Header | Block | Block | ... | Block |
272 +-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
274 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
275 ---> | Index | Stream Footer |
276 +=======+-+-+-+-+-+-+-+-+-+-+-+-+
278 All the above fields have a size that is a multiple of four. If
279 Stream is used as an internal part of another file format, it
280 is RECOMMENDED to make the Stream start at an offset that is
281 a multiple of four bytes.
283 Stream Header, Index, and Stream Footer are always present in
284 a Stream. The maximum size of the Index field is 16 GiB (2^34).
286 There are zero or more Blocks. The maximum number of Blocks is
287 limited only by the maximum size of the Index field.
289 Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
290 The same limit applies to the total amount of uncompressed
291 data stored in a Stream.
293 If an implementation supports handling .xz files with multiple
294 concatenated Streams, it MAY apply the above limits to the file
295 as a whole instead of limiting per Stream basis.
300 +---+---+---+---+---+---+-------+------+--+--+--+--+
301 | Header Magic Bytes | Stream Flags | CRC32 |
302 +---+---+---+---+---+---+-------+------+--+--+--+--+
305 2.1.1.1. Header Magic Bytes
307 The first six (6) bytes of the Stream are so called Header
308 Magic Bytes. They can be used to identify the file type.
310 Using a C array and ASCII:
311 const uint8_t HEADER_MAGIC[6]
312 = { 0xFD, '7', 'z', 'X', 'Z', 0x00 };
314 In plain hexadecimal:
318 - The first byte (0xFD) was chosen so that the files cannot
319 be erroneously detected as being in .lzma format, in which
320 the first byte is in the range [0x00, 0xE0].
321 - The sixth byte (0x00) was chosen to prevent applications
322 from misdetecting the file as a text file.
324 If the Header Magic Bytes don't match, the decoder MUST
328 2.1.1.2. Stream Flags
330 The first byte of Stream Flags is always a null byte. In the
331 future, this byte may be used to indicate a new Stream version
332 or other Stream properties.
334 The second byte of Stream Flags is a bit field:
336 Bit(s) Mask Description
337 0-3 0x0F Type of Check (see Section 3.4):
341 0x02 4 bytes (Reserved)
342 0x03 4 bytes (Reserved)
344 0x05 8 bytes (Reserved)
345 0x06 8 bytes (Reserved)
346 0x07 16 bytes (Reserved)
347 0x08 16 bytes (Reserved)
348 0x09 16 bytes (Reserved)
349 0x0A 32 bytes SHA-256
350 0x0B 32 bytes (Reserved)
351 0x0C 32 bytes (Reserved)
352 0x0D 64 bytes (Reserved)
353 0x0E 64 bytes (Reserved)
354 0x0F 64 bytes (Reserved)
355 4-7 0xF0 Reserved for future use; MUST be zero for now.
357 Implementations SHOULD support at least the Check IDs 0x00
358 (None) and 0x01 (CRC32). Supporting other Check IDs is
359 OPTIONAL. If an unsupported Check is used, the decoder SHOULD
360 indicate a warning or error.
362 If any reserved bit is set, the decoder MUST indicate an error.
363 It is possible that there is a new field present which the
364 decoder is not aware of, and can thus parse the Stream Header
370 The CRC32 is calculated from the Stream Flags field. It is
371 stored as an unsigned 32-bit little endian integer. If the
372 calculated value does not match the stored one, the decoder
373 MUST indicate an error.
375 The idea is that Stream Flags would always be two bytes, even
376 if new features are needed. This way old decoders will be able
377 to verify the CRC32 calculated from Stream Flags, and thus
378 distinguish between corrupt files (CRC32 doesn't match) and
379 files that the decoder doesn't support (CRC32 matches but
380 Stream Flags has reserved bits set).
385 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
386 | CRC32 | Backward Size | Stream Flags | Footer Magic Bytes |
387 +-+-+-+-+---+---+---+---+-------+------+----------+---------+
392 The CRC32 is calculated from the Backward Size and Stream Flags
393 fields. It is stored as an unsigned 32-bit little endian
394 integer. If the calculated value does not match the stored one,
395 the decoder MUST indicate an error.
397 The reason to have the CRC32 field before the Backward Size and
398 Stream Flags fields is to keep the four-byte fields aligned to
399 a multiple of four bytes.
402 2.1.2.2. Backward Size
404 Backward Size is stored as a 32-bit little endian integer,
405 which indicates the size of the Index field as multiple of
406 four bytes, minimum value being four bytes:
408 real_backward_size = (stored_backward_size + 1) * 4;
410 If the stored value does not match the real size of the Index
411 field, the decoder MUST indicate an error.
413 Using a fixed-size integer to store Backward Size makes
414 it slightly simpler to parse the Stream Footer when the
415 application needs to parse the Stream backwards.
418 2.1.2.3. Stream Flags
420 This is a copy of the Stream Flags field from the Stream
421 Header. The information stored to Stream Flags is needed
422 when parsing the Stream backwards. The decoder MUST compare
423 the Stream Flags fields in both Stream Header and Stream
424 Footer, and indicate an error if they are not identical.
427 2.1.2.4. Footer Magic Bytes
429 As the last step of the decoding process, the decoder MUST
430 verify the existence of Footer Magic Bytes. If they don't
431 match, an error MUST be indicated.
433 Using a C array and ASCII:
434 const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };
439 The primary reason to have Footer Magic Bytes is to make
440 it easier to detect incomplete files quickly, without
441 uncompressing. If the file does not end with Footer Magic Bytes
442 (excluding Stream Padding described in Section 2.2), it cannot
443 be undamaged, unless someone has intentionally appended garbage
444 after the end of the Stream.
449 Only the decoders that support decoding of concatenated Streams
450 MUST support Stream Padding.
452 Stream Padding MUST contain only null bytes. To preserve the
453 four-byte alignment of consecutive Streams, the size of Stream
454 Padding MUST be a multiple of four bytes. Empty Stream Padding
455 is allowed. If these requirements are not met, the decoder MUST
458 Note that non-empty Stream Padding is allowed at the end of the
459 file; there doesn't need to be a new Stream after non-empty
460 Stream Padding. This can be convenient in certain situations
463 The possibility of Stream Padding MUST be taken into account
464 when designing an application that parses Streams backwards,
465 and the application supports concatenated Streams.
470 +==============+=================+===============+=======+
471 | Block Header | Compressed Data | Block Padding | Check |
472 +==============+=================+===============+=======+
477 +-------------------+-------------+=================+
478 | Block Header Size | Block Flags | Compressed Size |
479 +-------------------+-------------+=================+
481 +===================+======================+
482 ---> | Uncompressed Size | List of Filter Flags |
483 +===================+======================+
485 +================+--+--+--+--+
486 ---> | Header Padding | CRC32 |
487 +================+--+--+--+--+
490 3.1.1. Block Header Size
492 This field overlaps with the Index Indicator field (see
495 This field contains the size of the Block Header field,
496 including the Block Header Size field itself. Valid values are
497 in the range [0x01, 0xFF], which indicate the size of the Block
498 Header as multiples of four bytes, minimum size being eight
501 real_header_size = (encoded_header_size + 1) * 4;
503 If a Block Header bigger than 1024 bytes is needed in the
504 future, a new field can be added between the Block Header and
505 Compressed Data fields. The presence of this new field would
506 be indicated in the Block Header field.
511 The Block Flags field is a bit field:
513 Bit(s) Mask Description
514 0-1 0x03 Number of filters (1-4)
515 2-5 0x3C Reserved for future use; MUST be zero for now.
516 6 0x40 The Compressed Size field is present.
517 7 0x80 The Uncompressed Size field is present.
519 If any reserved bit is set, the decoder MUST indicate an error.
520 It is possible that there is a new field present which the
521 decoder is not aware of, and can thus parse the Block Header
525 3.1.3. Compressed Size
527 This field is present only if the appropriate bit is set in
528 the Block Flags field (see Section 3.1.2).
530 The Compressed Size field contains the size of the Compressed
531 Data field, which MUST be non-zero. Compressed Size is stored
532 using the encoding described in Section 1.2. If the Compressed
533 Size doesn't match the size of the Compressed Data field, the
534 decoder MUST indicate an error.
537 3.1.4. Uncompressed Size
539 This field is present only if the appropriate bit is set in
540 the Block Flags field (see Section 3.1.2).
542 The Uncompressed Size field contains the size of the Block
543 after uncompressing. Uncompressed Size is stored using the
544 encoding described in Section 1.2. If the Uncompressed Size
545 does not match the real uncompressed size, the decoder MUST
548 Storing the Compressed Size and Uncompressed Size fields serves
550 - The decoder knows how much memory it needs to allocate
551 for a temporary buffer in multithreaded mode.
552 - Simple error detection: wrong size indicates a broken file.
553 - Seeking forwards to a specific location in streamed mode.
555 It should be noted that the only reliable way to determine
556 the real uncompressed size is to uncompress the Block,
557 because the Block Header and Index fields may contain
558 (intentionally or unintentionally) invalid information.
561 3.1.5. List of Filter Flags
563 +================+================+ +================+
564 | Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags |
565 +================+================+ +================+
567 The number of Filter Flags fields is stored in the Block Flags
568 field (see Section 3.1.2).
570 The format of each Filter Flags field is as follows:
572 +===========+====================+===================+
573 | Filter ID | Size of Properties | Filter Properties |
574 +===========+====================+===================+
576 Both Filter ID and Size of Properties are stored using the
577 encoding described in Section 1.2. Size of Properties indicates
578 the size of the Filter Properties field as bytes. The list of
579 officially defined Filter IDs and the formats of their Filter
580 Properties are described in Section 5.3.
582 Filter IDs greater than or equal to 0x4000_0000_0000_0000
583 (2^62) are reserved for implementation-specific internal use.
584 These Filter IDs MUST never be used in List of Filter Flags.
587 3.1.6. Header Padding
589 This field contains as many null byte as it is needed to make
590 the Block Header have the size specified in Block Header Size.
591 If any of the bytes are not null bytes, the decoder MUST
592 indicate an error. It is possible that there is a new field
593 present which the decoder is not aware of, and can thus parse
594 the Block Header incorrectly.
599 The CRC32 is calculated over everything in the Block Header
600 field except the CRC32 field itself. It is stored as an
601 unsigned 32-bit little endian integer. If the calculated
602 value does not match the stored one, the decoder MUST indicate
605 By verifying the CRC32 of the Block Header before parsing the
606 actual contents allows the decoder to distinguish between
607 corrupt and unsupported files.
612 The format of Compressed Data depends on Block Flags and List
613 of Filter Flags. Excluding the descriptions of the simplest
614 filters in Section 5.3, the format of the filter-specific
615 encoded data is out of scope of this document.
620 Block Padding MUST contain 0-3 null bytes to make the size of
621 the Block a multiple of four bytes. This can be needed when
622 the size of Compressed Data is not a multiple of four. If any
623 of the bytes in Block Padding are not null bytes, the decoder
624 MUST indicate an error.
629 The type and size of the Check field depends on which bits
630 are set in the Stream Flags field (see Section 2.1.1.2).
632 The Check, when used, is calculated from the original
633 uncompressed data. If the calculated Check does not match the
634 stored one, the decoder MUST indicate an error. If the selected
635 type of Check is not supported by the decoder, it SHOULD
636 indicate a warning or error.
641 +-----------------+===================+
642 | Index Indicator | Number of Records |
643 +-----------------+===================+
645 +=================+===============+-+-+-+-+
646 ---> | List of Records | Index Padding | CRC32 |
647 +=================+===============+-+-+-+-+
649 Index serves several purposes. Using it, one can
650 - verify that all Blocks in a Stream have been processed;
651 - find out the uncompressed size of a Stream; and
652 - quickly access the beginning of any Block (random access).
657 This field overlaps with the Block Header Size field (see
658 Section 3.1.1). The value of Index Indicator is always 0x00.
661 4.2. Number of Records
663 This field indicates how many Records there are in the List
664 of Records field, and thus how many Blocks there are in the
665 Stream. The value is stored using the encoding described in
666 Section 1.2. If the decoder has decoded all the Blocks of the
667 Stream, and then notices that the Number of Records doesn't
668 match the real number of Blocks, the decoder MUST indicate an
674 List of Records consists of as many Records as indicated by the
675 Number of Records field:
678 | Record | Record | ...
681 Each Record contains information about one Block:
683 +===============+===================+
684 | Unpadded Size | Uncompressed Size |
685 +===============+===================+
687 If the decoder has decoded all the Blocks of the Stream, it
688 MUST verify that the contents of the Records match the real
689 Unpadded Size and Uncompressed Size of the respective Blocks.
691 Implementation hint: It is possible to verify the Index with
692 constant memory usage by calculating for example SHA-256 of
693 both the real size values and the List of Records, then
694 comparing the hash values. Implementing this using
695 non-cryptographic hash like CRC32 SHOULD be avoided unless
696 small code size is important.
698 If the decoder supports random-access reading, it MUST verify
699 that Unpadded Size and Uncompressed Size of every completely
700 decoded Block match the sizes stored in the Index. If only
701 partial Block is decoded, the decoder MUST verify that the
702 processed sizes don't exceed the sizes stored in the Index.
707 This field indicates the size of the Block excluding the Block
708 Padding field. That is, Unpadded Size is the size of the Block
709 Header, Compressed Data, and Check fields. Unpadded Size is
710 stored using the encoding described in Section 1.2. The value
711 MUST never be zero; with the current structure of Blocks, the
712 actual minimum value for Unpadded Size is five.
714 Implementation note: Because the size of the Block Padding
715 field is not included in Unpadded Size, calculating the total
716 size of a Stream or doing random-access reading requires
717 calculating the actual size of the Blocks by rounding Unpadded
718 Sizes up to the next multiple of four.
720 The reason to exclude Block Padding from Unpadded Size is to
721 ease making a raw copy of Compressed Data without Block
722 Padding. This can be useful, for example, if someone wants
723 to convert Streams to some other file format quickly.
726 4.3.2. Uncompressed Size
728 This field indicates the Uncompressed Size of the respective
729 Block as bytes. The value is stored using the encoding
730 described in Section 1.2.
735 This field MUST contain 0-3 null bytes to pad the Index to
736 a multiple of four bytes. If any of the bytes are not null
737 bytes, the decoder MUST indicate an error.
742 The CRC32 is calculated over everything in the Index field
743 except the CRC32 field itself. The CRC32 is stored as an
744 unsigned 32-bit little endian integer. If the calculated
745 value does not match the stored one, the decoder MUST indicate
751 The Block Flags field defines how many filters are used. When
752 more than one filter is used, the filters are chained; that is,
753 the output of one filter is the input of another filter. The
754 following figure illustrates the direction of data flow.
756 v Uncompressed Data ^
758 Encoder | Filter 1 | Decoder
765 Alignment of uncompressed input data is usually the job of
766 the application producing the data. For example, to get the
767 best results, an archiver tool should make sure that all
768 PowerPC executable files in the archive stream start at
769 offsets that are multiples of four bytes.
771 Some filters, for example LZMA2, can be configured to take
772 advantage of specified alignment of input data. Note that
773 taking advantage of aligned input can be beneficial also when
774 a filter is not the first filter in the chain. For example,
775 if you compress PowerPC executables, you may want to use the
776 PowerPC filter and chain that with the LZMA2 filter. Because
777 not only the input but also the output alignment of the PowerPC
778 filter is four bytes, it is now beneficial to set LZMA2
779 settings so that the LZMA2 encoder can take advantage of its
780 four-byte-aligned input data.
782 The output of the last filter in the chain is stored to the
783 Compressed Data field, which is is guaranteed to be aligned
784 to a multiple of four bytes relative to the beginning of the
785 Stream. This can increase
786 - speed, if the filtered data is handled multiple bytes at
787 a time by the filter-specific encoder and decoder,
788 because accessing aligned data in computer memory is
790 - compression ratio, if the output data is later compressed
791 with an external compression tool.
796 If filters would be allowed to be chained freely, it would be
797 possible to create malicious files, that would be very slow to
798 decode. Such files could be used to create denial of service
801 Slow files could occur when multiple filters are chained:
803 v Compressed input data
804 | Filter 1 decoder (last filter)
805 | Filter 0 decoder (non-last filter)
806 v Uncompressed output data
808 The decoder of the last filter in the chain produces a lot of
809 output from little input. Another filter in the chain takes the
810 output of the last filter, and produces very little output
811 while consuming a lot of input. As a result, a lot of data is
812 moved inside the filter chain, but the filter chain as a whole
813 gets very little work done.
815 To prevent this kind of slow files, there are restrictions on
816 how the filters can be chained. These restrictions MUST be
817 taken into account when designing new filters.
819 The maximum number of filters in the chain has been limited to
820 four, thus there can be at maximum of three non-last filters.
821 Of these three non-last filters, only two are allowed to change
822 the size of the data.
824 The non-last filters, that change the size of the data, MUST
825 have a limit how much the decoder can compress the data: the
826 decoder SHOULD produce at least n bytes of output when the
827 filter is given 2n bytes of input. This limit is not
828 absolute, but significant deviations MUST be avoided.
830 The above limitations guarantee that if the last filter in the
831 chain produces 4n bytes of output, the chain as a whole will
832 produce at least n bytes of output.
839 LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
840 compression algorithm with high compression ratio and fast
841 decompression. LZMA is based on LZ77 and range coding
844 LZMA2 is an extension on top of the original LZMA. LZMA2 uses
845 LZMA internally, but adds support for flushing the encoder,
846 uncompressed chunks, eases stateful decoder implementations,
847 and improves support for multithreading. Thus, the plain LZMA
848 will not be supported in this file format.
851 Size of Filter Properties: 1 byte
852 Changes size of data: Yes
853 Allow as a non-last filter: No
854 Allow as the last filter: Yes
857 Input data: Adjustable to 1/2/4/8/16 byte(s)
860 The format of the one-byte Filter Properties field is as
863 Bits Mask Description
864 0-5 0x3F Dictionary Size
865 6-7 0xC0 Reserved for future use; MUST be zero for now.
867 Dictionary Size is encoded with one-bit mantissa and five-bit
868 exponent. The smallest dictionary size is 4 KiB and the biggest
871 Raw value Mantissa Exponent Dictionary size
885 40 2 31 4096 MiB - 1 B
887 Instead of having a table in the decoder, the dictionary size
888 can be decoded using the following C code:
890 const uint8_t bits = get_dictionary_flags() & 0x3F;
892 return DICTIONARY_TOO_BIG; // Bigger than 4 GiB
894 uint32_t dictionary_size;
896 dictionary_size = UINT32_MAX;
898 dictionary_size = 2 | (bits & 1);
899 dictionary_size <<= bits / 2 + 11;
903 5.3.2. Branch/Call/Jump Filters for Executables
905 These filters convert relative branch, call, and jump
906 instructions to their absolute counterparts in executable
907 files. This conversion increases redundancy and thus
910 Size of Filter Properties: 0 or 4 bytes
911 Changes size of data: No
912 Allow as a non-last filter: Yes
913 Allow as the last filter: No
915 Below is the list of filters in this category. The alignment
916 is the same for both input and output data.
918 Filter ID Alignment Description
919 0x04 1 byte x86 filter (BCJ)
920 0x05 4 bytes PowerPC (big endian) filter
921 0x06 16 bytes IA64 filter
922 0x07 4 bytes ARM filter [1]
923 0x08 2 bytes ARM Thumb filter [1]
924 0x09 4 bytes SPARC filter
925 0x0A 4 bytes ARM64 filter [2]
927 [1] These are for little endian instruction encoding.
928 This must not be confused with data endianness.
929 A processor configured for big endian data access
930 may still use little endian instruction encoding.
931 The filters don't care about the data endianness.
933 [2] 4096-byte alignment gives the best results
934 because the address in the ADRP instruction
935 is a multiple of 4096 bytes.
937 If the size of Filter Properties is four bytes, the Filter
938 Properties field contains the start offset used for address
939 conversions. It is stored as an unsigned 32-bit little endian
940 integer. The start offset MUST be a multiple of the alignment
941 of the filter as listed in the table above; if it isn't, the
942 decoder MUST indicate an error. If the size of Filter
943 Properties is zero, the start offset is zero.
945 Setting the start offset may be useful if an executable has
946 multiple sections, and there are many cross-section calls.
947 Taking advantage of this feature usually requires usage of
948 the Subblock filter, whose design is not complete yet.
953 The Delta filter may increase compression ratio when the value
954 of the next byte correlates with the value of an earlier byte
955 at specified distance.
958 Size of Filter Properties: 1 byte
959 Changes size of data: No
960 Allow as a non-last filter: Yes
961 Allow as the last filter: No
965 Output data: Same as the original input data
967 The Properties byte indicates the delta distance, which can be
968 1-256 bytes backwards from the current byte: 0x00 indicates
969 distance of 1 byte and 0xFF distance of 256 bytes.
972 5.3.3.1. Format of the Encoded Output
974 The code below illustrates both encoding and decoding with
977 // Distance is in the range [1, 256].
978 const unsigned int distance = get_properties_byte() + 1;
982 memset(delta, 0, sizeof(delta));
985 const int byte = read_byte();
989 uint8_t tmp = delta[(uint8_t)(distance + pos)];
991 tmp = (uint8_t)(byte) - tmp;
992 delta[pos] = (uint8_t)(byte);
994 tmp = (uint8_t)(byte) + tmp;
1003 5.4. Custom Filter IDs
1005 If a developer wants to use custom Filter IDs, there are two
1006 choices. The first choice is to contact Lasse Collin and ask
1007 him to allocate a range of IDs for the developer.
1009 The second choice is to generate a 40-bit random integer
1010 which the developer can use as a personal Developer ID.
1011 To minimize the risk of collisions, Developer ID has to be
1012 a randomly generated integer, not manually selected "hex word".
1013 The following command, which works on many free operating
1014 systems, can be used to generate Developer ID:
1016 dd if=/dev/urandom bs=5 count=1 | hexdump
1018 The developer can then use the Developer ID to create unique
1019 (well, hopefully unique) Filter IDs.
1021 Bits Mask Description
1022 0-15 0x0000_0000_0000_FFFF Filter ID
1023 16-55 0x00FF_FFFF_FFFF_0000 Developer ID
1024 56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F
1026 The resulting 63-bit integer will use 9 bytes of space when
1027 stored using the encoding described in Section 1.2. To get
1028 a shorter ID, see the beginning of this Section how to
1029 request a custom ID range.
1032 5.4.1. Reserved Custom Filter ID Ranges
1035 0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility
1036 0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
1037 0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility
1040 6. Cyclic Redundancy Checks
1042 There are several incompatible variations to calculate CRC32
1043 and CRC64. For simplicity and clarity, complete examples are
1044 provided to calculate the checks as they are used in this file
1045 format. Implementations MAY use different code as long as it
1046 gives identical results.
1048 The program below reads data from standard input, calculates
1049 the CRC32 and CRC64 values, and prints the calculated values
1050 as big endian hexadecimal strings to standard output.
1053 #include <inttypes.h>
1056 uint32_t crc32_table[256];
1057 uint64_t crc64_table[256];
1062 static const uint32_t poly32 = UINT32_C(0xEDB88320);
1063 static const uint64_t poly64
1064 = UINT64_C(0xC96C5795D7870F42);
1066 for (size_t i = 0; i < 256; ++i) {
1070 for (size_t j = 0; j < 8; ++j) {
1072 crc32 = (crc32 >> 1) ^ poly32;
1077 crc64 = (crc64 >> 1) ^ poly64;
1082 crc32_table[i] = crc32;
1083 crc64_table[i] = crc64;
1088 crc32(const uint8_t *buf, size_t size, uint32_t crc)
1091 for (size_t i = 0; i < size; ++i)
1092 crc = crc32_table[buf[i] ^ (crc & 0xFF)]
1098 crc64(const uint8_t *buf, size_t size, uint64_t crc)
1101 for (size_t i = 0; i < size; ++i)
1102 crc = crc64_table[buf[i] ^ (crc & 0xFF)]
1112 uint32_t value32 = 0;
1113 uint64_t value64 = 0;
1114 uint64_t total_size = 0;
1118 const size_t buf_size
1119 = fread(buf, 1, sizeof(buf), stdin);
1123 total_size += buf_size;
1124 value32 = crc32(buf, buf_size, value32);
1125 value64 = crc64(buf, buf_size, value64);
1128 printf("Bytes: %" PRIu64 "\n", total_size);
1129 printf("CRC-32: 0x%08" PRIX32 "\n", value32);
1130 printf("CRC-64: 0x%016" PRIX64 "\n", value64);
1138 LZMA SDK - The original LZMA implementation
1139 http://7-zip.org/sdk.html
1141 LZMA Utils - LZMA adapted to POSIX-like systems
1142 http://tukaani.org/lzma/
1144 XZ Utils - The next generation of LZMA Utils
1145 http://tukaani.org/xz/
1148 GZIP file format specification version 4.3
1149 http://www.ietf.org/rfc/rfc1952.txt
1150 - Notation of byte boxes in section "2.1. Overall conventions"
1153 Key words for use in RFCs to Indicate Requirement Levels
1154 http://www.ietf.org/rfc/rfc2119.txt
1158 http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
1159 - Node 9.4.2 "Blocking Factor", paragraph that begins
1160 "gzip will complain about trailing garbage"
1161 - Note that this URL points to the latest version of the
1162 manual, and may some day not contain the note which is in
1163 1.21. For the exact version of the manual, download GNU
1164 tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz