contrib/compressor/algorithm.doc

   1 The  compressor achieves  an  average compression  rate of 60%  of the
   2 original size which is on par with "gzip". It seems that you cannot do
   3 much better for compressing  compiled  binaries.  This means that  the
   4 break even  point  for using compressed  images is   reached, once the
   5 uncompressed size approaches 1.5kB. We  can stuff more than 12kB  into
   6 an 8kB EPROM and more than 25kB into an 16kB EPROM.   As there is only
   7 32kB of RAM  for both the uncompressed  image  and its BSS  area, this
   8 means that 32kB EPROMs will hardly ever be required.
   9
  10 The compression  algorithm uses a  4kB  ring buffer  for buffering the
  11 uncompressed data. Before   compression starts,  the  ring buffer   is
  12 filled  with spaces (ASCII  character  0x20).  The algorithm tries  to
  13 find repeated  input sequences of a  maximum length of  60 bytes.  All
  14 256 different input  bytes  plus the 58 (60   minus a threshold of  2)
  15 possible  repeat lengths form a set  of 314 symbols. These symbols are
  16 adaptively Huffman encoded.  The  algorithm starts out with a Huffmann
  17 tree  that  assigns equal code lengths    to each of  the  314 symbols
  18 (slightly favoring the repeat  symbols over symbols for regular  input
  19 characters), but  it will be changed whenever  the frequency of any of
  20 the symbols  changes. Frequency counts are  kept in 16bit  words until
  21 the total number of compressed codes totals 2^15.  Then, all frequency
  22 counts will be halfed (rounding to the bigger number).  For unrepeated
  23 characters (symbols 0..255) the Huffman code  is written to the output
  24 stream.  For repeated characters the  Huffmann code, which denotes the
  25 length of the repeated character sequence, is written out and then the
  26 index in the ring buffer is computed.   From this index, the algorithm
  27 computes  the offset   relative to  the current  index  into  the ring
  28 buffer. Thus,  for typical input data,  one would expect that short to
  29 medium range offsets are more frequent  than extremely short or medium
  30 range to long range offsets. Thus the  12bit (for a 4kB buffer) offset
  31 value  is statically Huffman encoded  using a precomputed Huffman tree
  32 that favors  those  offset  values    that  are deemed to   be    more
  33 frequent. The  Huffman encoded offset  is  written to the output  data
  34 stream,  directly  following the code  that   determines the length of
  35 repeated characters.
  36
  37 This algorithm, as implemented in the  C example code, looks very good
  38 and  its operating parameters are   already well optimized. This  also
  39 explains   why  it achieves     compression ratios    comparable  with
  40 "gzip". Depending on the input data, it sometimes excells considerably
  41 beyond what "gzip -9" does, but this  phenomenon does not appear to be
  42 typical. There are some flaws with  the algorithm, such as the limited
  43 buffer  sizes, the  adaptive  Huffman tree  which takes  very  long to
  44 change, if    the input  characters  experience   a sudden   change in
  45 distribution, and the static Huffman   tree for encoding offsets  into
  46 the  buffer.   The slow  changes of   the  adaptive  Huffman  tree are
  47 partially counteracted by  artifically keeping  a 16bit precision  for
  48 the frequency counts, but  this does not  come into play until 32kB of
  49 compressed data is output, so  it does not  have any impact on our use
  50 for "etherboot", because  the BOOT Prom  does not support uncompressed
  51 data of more then 32kB (c.f. doc/spec.doc).
  52
  53 Nonetheless,  these problems  do  not  seem  to affect  compression of
  54 compiled  programs very much.  Mixing  object code with English  text,
  55 would not work too  well though, and  the algorithm should be reset in
  56 between. Actually, we  might  gain a little  improvement, if  text and
  57 data   segments    were compressed  individually,    but   I have  not
  58 experimented with this option, yet.