BSByteStream.h

Simple Burrows-Wheeler general purpose compressor.

o BSByteStream
Performs bzz compression/decompression.
Files "BSByteStream.h" and "BSByteStream.cpp" implement a very compact general purpose compressor based on the Burrows-Wheeler transform. The utility program bzz provides a front-end for this class. Although this compression model is not currently used in DjVu files, it may be used in the future for encoding textual data chunks.

Algorithms --- The Burrows-Wheeler transform (also named Block-Sorting) is performed using a combination of the Karp-Miller-Rosenberg and the Bentley-Sedgewick algorithms. This is comparable to (Sadakane, DCC 98) with a slightly more flexible ranking scheme. Symbols are then ordered according to a running estimate of their occurrence frequencies. The symbol ranks are then coded using a simple fixed tree and the ZPCodec binary adaptive coder.

Performances --- The basic algorithm is mostly similar to those implemented in well known compressors like bzip or bzip2 (http://www.muraroa.demon.co.uk). The adaptive binary coder however generates small differences. The adaptation noise may cost up to 5% in file size, but this penalty is usually offset by the benefits of adaptation. This is good when processing large and highly structured files like spreadsheet files. Compression and decompression speed is about twice slower than bzip2 but the sorting algorithms is more robust. Unlike bzip2 (as of August 1998), this code can compress half a megabyte of "abababab...." in bounded time.

Here are some comparative results (in bits per character) obtained on the Canterbury Corpus (http://corpus.canterbury.ac.nz) as of August 1998. The BSByteStream performance on the single spreadsheet file Excl moves bzz's weighted average ahead of much more sophisticated methods, like Suzanne Bunton's fsmxBest system http://corpus.canterbury.ac.nz/methodinfo/fsmx.html. This result will not last very long.

text fax Csrc Excl SPRC tech poem html lisp man play Weighted Average
compress 3.27 0.97 3.56 2.41 4.21 3.06 3.38 3.68 3.90 4.43 3.51 2.55 3.31
gzip -9 2.85 0.82 2.24 1.63 2.67 2.71 3.23 2.59 2.65 3.31 3.12 2.08 2.53
bzip2 -9 2.27 0.78 2.18 1.01 2.70 2.02 2.42 2.48 2.79 3.33 2.53 1.54 2.23
ppmd 2.31 0.99 2.11 1.08 2.68 2.19 2.48 2.38 2.43 3.00 2.53 1.65 2.20
fsmx 2.10 0.79 1.89 1.48 2.52 1.84 2.21 2.24 2.29 2.91 2.35 1.63 2.06
bzz 2.25 0.76 2.13 0.78 2.67 2.00 2.40 2.52 2.60 3.19 2.52 1.44 2.16

Note that the DjVu people have several entries in this table. Program compress was written some time ago by Joe Orost (http://www.research.att.com/info/orost). The ppmc method, (a precursor of ppmd) was created by Paul Howard (http://www.research.att.com/info/pgh). The bzz program is just below your eyes.

Author:
Léon Bottou <leonb@research.att.com> -- Initial implementation
Andrei Erofeev <eaf@research.att.com> -- Improved Block Sorting algorithm.
Version:
$Id: BSByteStream.h.html,v 1.2 2000/08/26 00:09:29 bcr Exp $

Alphabetic index Hierarchy of classes