Data Compression with the Burrows-Wheeler Transform

This page contains my original text and figures for the article that appeared in the ??? 1996 DDJ. I haven't broken it up into pages, so loading the entire thing might take some time.

The world is not yet exhausted; let me see something tomorrow which I never saw before. -- Samuel Johnson

In all chaos there is a cosmos, in all disorder a secret order. -- Carl Jung

Introduction

In mathematics, difficult problems can often be simplified by performing a transformation on a data set. For example, digital signal processing programs use the FFT to convert sets of sampled audio data points to equivalent sets of frequency data. Pumping up the bass or applying a notch filter is then just a matter of multiplying some frequency data points by a scaling factor. Perform an inverse FFT on the resulting points, and voila, you have a new audio waveform that has been transformed according to your specifications.

Michael Burrows and David Wheeler recently released the details of a transformation function that opens the door to some revolutionary new data compression techniques. The Burrows-Wheeler Transform, or BWT, transforms a block of data into a format that is extremely well suited for compression. It does such a good job at this that even the simple demonstration programs I'll present here will outperform state of the art programs.

In this article, I'll first present the BWT, and demonstrate its reversibility. I'll then show you how a block of data transformed by the BWT can be compressed using standard techniques. After discussing the advantages and disadvantages of this type of data compression, I'll present my simple demonstration program. And no article on data compression would be complete without the pro forma comparison table, showing how BWT-based compression stacks up against other modern techniques.

The Burrows-Wheeler Transform

Michael Burrows and David Wheeler released a research report in 1994 discussing work they had been doing at the Digital Systems Research Center in Palo Alto, California. Their paper, "A Block-sorting Lossless Data Compression Algorithm" presented a data compression algorithm based on a previously unpublished transformation discovered by Wheeler in 1983.

While the paper discusses a complete set of algorithms for compression and decompression, the real heart of the paper consists of the disclosure of the BWT algorithm.

BWT Basics

The BWT is an algorithm that takes a block of data and rearranges it using a sorting algorithm. The resulting output block contains exactly the same data elements that it started with, differing only in their ordering. The transformation is reversible, meaning the original ordering of the data elements can be restored with no loss of fidelity.

The BWT is performed on an entire block of data at once. Most of today's familiar lossless compression algorithms operate in streaming mode, reading a single byte or a few bytes at a time. But with this new transform, we want to operate on the largest chunks of data possible. Since the BWT operates on data in memory, you may encounter files too big to process in one fell swoop. In these cases, the file must be split up and processed a block at a time. The demo programs that accompany this article work comfortably with block sizes of 50Kbytes up to 250 Kbytes.

For purposes of illustration, I'll start with a much smaller data set, shown in Figure 1. This string contains seven bytes of data. In order to perform the BWT, the first thing we do is treat a string S, of length N, as if it actually contains N different strings, with each character in the original string being the start of a specific string that is N bytes long. (In this case, the word string doesn't have the C/C++ semantics of being a null terminated set of characters. A string is just a collection of bytes.) We also treat the buffer as if the last character wraps around back to the first.

It's important to remember at this point that we don't actually make N -1 rotated copies of the input string. In the demonstration program, we just represent each of the strings by a pointer or an index into a memory buffer.

The next step in the BWT is to perform a lexicographical sort on the set of input strings. That is, we want to order the strings using a fixed comparison function. This means we should compare using a function similar to the C functions strcmp() or memcmp(). In this high level view of the algorithm the comparison function has to be able to wrap around when it reaches the end of the buffer, so a slightly modified comparison function would be needed.

After sorting, the set of strings is arranged as shown in Figure 3. There are two important points to note in this picture. First, the strings have been sorted, but we've kept track of which string occupied which position in the original set. So, we know that the String 0, the original unsorted string, has now moved down to row 4 in the array.

Second, I've tagged the first and last columns in the matrix with the special designations F and L, for the first and last columns of the array. Column F contains all the characters in the original string in sorted order. So our original string "DRDOBBS" is represented in F as "BBDDORS".

The characters in column L don't appear to be in any particular order, but in fact they have an interesting property. Each of the characters in L is the prefix character to the string that starts in the same row in column F.

The actual output of the BWT, oddly enough, consists of two things: a copy of column L, and the primary index, an integer indicating which row contains the original first character of the buffer B. So performing the BWT on our original string generates the output string L which contains "OBRSDDB", and a primary index of 5.

The integer 5 is found easily enough since the original first character of the buffer will always be found in column L in the row that contains S1. Since S1 is simply S0 rotated left by a single character position, the very first character of the buffer is rotated into the last column of the matrix. Therefore, locating S1 is equivalent to locating the buffer's first character position in L.

Two obvious questions

At this point in the exposition, there are two obvious questions. First, it doesn't seem possible that this is a reversible transformation. Generally, a sort() function doesn't come with an unsort() partner that can restore your original ordering. In fact, it isn't likely that you've ever even considered this as something you might like. And second, what possible good does this strange transformation do you?

I'll defer the answer to the second question while I explain the reversibility of this transform. Unsorting column L requires the use of something called the transformation vector. The transformation vector is an array that defines the order in which the rotated strings are scattered throughout the rows of the matrix of Figure 3.

The transformation vector, T, is an array with one index for each row in column F. For a given row i, T[ i ] is defined as the row where S[ i + 1 ] is found. In Figure 3, row 3 contains S0, the original input string, and row 5 contains S1, the string rotated one character to the left. Thus, T[ 3 ] contains the value 5. S2 is found in row 2, so T[ 5 ] contains a 2. For this particular matrix, the transformation vector can be calculated to be { 1, 6, 4, 5, 0, 2, 3 }.

The transformation vector routes S[ i ] to S[ i + 1]

Figure 4 shows how the transformation vector is used to walk through the various rows. For any row that contains S[ i ], the vector provides the value of the row where S[ i + 1 ] is found.

The Rosetta Vector

The reason the transformation vector is so important is that it provides the key to restoring L to its original order. Given L and the primary index, you can restore the original S0. For this example, the following code does the job:

You can get there from here

So now we come to the core premise of the Burrows-Wheeler transform: given a copy of L, you can calculate the transformation vector for the original input matrix. And consequently, given the primary index, you can recreate S0, or the input string.

The key that makes this possible is that calculating the transformation vector requires only that you know the contents of the first and last columns of the matrix. And believe it or not, simply having a copy of L means that you do, in fact, have a copy of F as well.".

Given just the copy of L, we don't know much about the state of the matrix. Figure 5 shows L, which I've moved into the first column for purposes of illustration. In this figure, F is going to be in the next column. And fortunately for us, F has an important characteristic: it contains all of the characters from the input string in sorted order. Since L also contains all the same characters, we can determine the contents of F by simply sorting L!

Now we can start working on calculating T. The character 'O' in row 0 clearly moves to row 4 in column F, which means T[ 4 ] = 0. But what about row 1? The 'B' could match up with either the 'B' in row 0 or row 1. Which do we select?

Fortunately, the choice here is not ambiguous, although the decision making process may not be intuitive. Remember that by definition, column F is sorted. This means that all the strings beginning with 'B' in column L also appear in sorted order. Why? They all start with the same character, and they are sorted on their second character, by virtue of their second characters appearing in column F.

Since by definition the strings in F must appear in sorted order, it means that all the strings that start with a common character in L appear in the same order in F, although not necessarily in the same rows. Because of this, we know that the 'B' in row 1 of L is going to move up to row 0 in F. The 'B' in row 6 of L moves to row 1 of F.

Once that difficulty is cleared, it's a simple matter to recover the transformation matrix from the two columns. And once that is done, recovering the original input string is short work as well. Simply applying the C++ code shown earlier does the job.

The Punchline

The whole point to this BWT exercise is supposed be better data compression, so let's take a look at how this works. Figure 7 shows a snippet of debug output taken from the test program BWT.CPP. Each line of output shows a prefix character followed by a sorted string. The prefix character is actually what is found in column L after the strings have been sorted.

t: hat acts like this:<13><10><1
t: hat buffer to the constructor
t: hat corrupted the heap, or wo
W: hat goes up must come down<13
t: hat happens, it isn't  likely
w: hat if you want to dynamicall
t: hat indicates an error.<13><1
t: hat it removes arguments from
t: hat looks like this:<13><10><
t: hat looks something like this
t: hat looks something like this
t: hat once I detect the mangled

This section of output consists of a group of sorted strings that all start with the characters "hat". Not surprisingly, the prefix character for almost all of these strings is a lower case 't'. There are also a couple of appearances of the letter 'w'.

Anytime you see repetition like this, you clearly have an opportunity for data compression. Sorting all of the strings gives rise to a set of prefix characters that has lots of little clumps where characters repeat.

Using the opportunity

We could take the output of the BWT and just apply a conventional compressor to it, but Burrows and Wheeler suggest an improved approach. They recommend using a Move to Front scheme, followed by an entropy encoder.

A Move to Front (or MTF) encoder is a fairly trivial piece of work. It simply keeps all 256 possible codes in a list. Each time a character is to be output, you send its position in the list, then move it to the front. In the above example, column L might start with the following sequence: "tttWtwttt". If we encode that using an MTF scheme, we would output the following integers: { 116, 0, 0, 88, 1, 119, 1, 0, 0 }.

If the output of the BWT operation contains a high proportion of repeated characters, we can expect that applying the MTF encoder will give us a file filled with lots of zeros, and heavily weighted towards the lowest integers.

At that point, the file can finally be compressed using an entropy encoder, typically a Huffman or arithmetic encoder. The example programs that accompany this article use an adaptive order-0 arithmetic encoder as the final step of the process.

And the winner is…

Using the BWT as a front end for data compression has a result similar to that of simple statistical modeling programs. An order-n statistical model simply uses the previous n characters to predict the value of the next character in a string. Compression programs based on statistical modeling can give good results, at the expense of high memory requirements.

The BWT in effect uses the entire input string as its context. But instead of using a preceding run of characters to model the output, it uses the string following a character as a model. Because of the sorting method used, characters hundreds of bytes downstream can have an effect on the ordering of the BWT output.

This characteristic of the BWT has another side effect. It means that in general, the bigger the block size, the better the compression. As long as the data set is homogenous, bigger blocks will generate longer runs of repeats, leading to improved compression. Of course, the sorting operations needed at the front end of the BWT will usually have O( N * log(N) ) performance, meaning bigger blocks might slow things down considerably. (Although Burrows and Wheeler report that a suffix tree sort can be done in linear time and space.)

Programmers today might also be impressed by the fact that the BWT is apparently not covered by any software patents. I asked Burrows and Wheeler if any patents had been initiated by DEC, and both indicated not. Since this algorithm was published in 1994, it may mean the technique is in the public domain. (Consult an attorney!)

Even if the BWT is in the public domain, programmers need to be careful when selecting an entropy encoder for the final stage of compression. Arithmetic coding offers excellent compression, but is covered by quite a few patents. Huffman coding is slightly less efficient, but seems less likely to provoke intellectual property fights.

A Reference Implementation

One of the nice things about BWT compression is that it's easy to isolate each step of the process. I took advantage of that when writing the demo programs to accompany this article. The compression and decompression processes will consist of running several programs in sequence, piping the output of one program to the input of another, and repeating until the process is complete.

Compressing a file using the demo programs is done using the following command line:

RLE input-file | BWT | MTF | RLE | ARI > output-file

`RLE.CPP`	This program implements a simple run-length encoder. If the input file has many long runs of identical characters, the sorting procedure in the BWT can be degraded dramatically. The RLE front end prevents that from happening.
`BWT.CPP`	The standard Burrows-Wheeler transform is done here. This program outputs repeated blocks consisting of a block size integer, a copy of L, the primary index, and a special last character index. This is repeated until BWT.EXE runs out of input data.
`MTF.CPP`	The Move to Front encoder operates as described in the previous section.
`RLE.CPP`	The fact that the output file is top-heavy with runs containing zeros means that applying another RLE pass to the output can improve overall compression. I believe that further processing of the MTF output will provide fertile ground for additional improvements.
`ARI.CPP`	This is an order-0 adaptive arithmetic encoder, directly derived from the code published by Witten and Cleary in their 1987 CACM article.

The output of the compressor can be decompressed by applying the same algorithms in reverse. The command line that accomplishes this is:

Each of the programs here has a one-to-one correspondence with its namesake from the compression section.

The 10 source files that make up this program suite are all fairly short, and their inner workings should be fairly clear. The most complicated programs are the arithmetic encoder and decoder, which are well documented elsewhere.

Results

I used the standard Calgary Corpus to benchmark my version of BWT compression. The corpus is a set of 18 files with varying characteristics that can be found via anonymous ftp at ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus. For purposes of comparison I show the raw size of the file, the compressed sizes using the BWT compressor, and the compressed sizes using PKZIP with the default compression level selected.

File Name	Raw Size	PKZIP Size	PKZIP Bits/Byte	BTW Size	BTW Bits/Byte
bib	111,261	35,821	2.58	29,567	2.13
book1	768,771	315,999	3.29	275,831	2.87
book2	610,856	209,061	2.74	186,592	2.44
geo	102,400	68,917	5.38	62,120	4.85
news	377,109	146,010	3.10	134,174	2.85
obj1	21,504	10,311	3.84	10,857	4.04
obj2	246,814	81,846	2.65	81,948	2.66
paper1	53,161	18,624	2.80	17,724	2.67
paper2	82,199	29,795	2.90	26,956	2.62
paper3	46,526	18,106	3.11	16,995	2.92
paper4	13,286	5,509	3.32	5,529	3.33
paper5	11,954	4,962	3.32	5,136	3.44
paper6	38,105	13,331	2.80	13,159	2.76
pic	513,216	54,188	0.84	50,829	0.79
progc	39,611	13,340	2.69	13,312	2.69
progl	71,646	16,227	1.81	16,688	1.86
progp	49,379	11,248	1.82	11,404	1.85
trans	93,695	19,691	1.68	19,301	1.65
total:	3,251,493	1,072,986	2.64	978,122	2.41

As Table 1 shows, this demonstration program manages to compete pretty well with the compression offered by PKWare's commercial product, PKZip. I consider this particularly encouraging in light of the fact that there should be plenty of headroom left for tweaking additional percentage points from my code.

Acknowledgments

I also have to thank Quincey Koziol and Philipp Knirsch for the help they gave me with UNIX testing. Without their assistance, the source code for this article would have only been usable under MS-DOS and Windows.

Bibliography and Resources

You can find resources, additional source code, and commentary regarding this compression on my personal web page at http://www.dogma.net/markn, under the "Magazine Articles" section. An excellent source of information on Data Compression in general can be found at http://www.internz.com/compression-pointers.html.

Witten, I.H., Neal, R., and Cleary, J.G., (1987b) "Arithmetic coding for data compression," Communications of the Association for Computing Machinery,30 (6) 520-540, June.

Source Code

rle.cpp	The run length encoder program.
bwt.cpp	Burrows-Wheeler transform code, implemented using the traditional C function `qsort()`.
bwta.cpp	Burrows-Wheeler transform code, implemented using the STL.
mtf.cpp	A Move To Front encoder.
ari.cpp	An order-0 adaptive arithmetic encoder, using the code from the 1987 CACM article by Witten, et.al.
unrle.cpp	Run Length decoder, undoes the encoding performed by rle.cpp.
unbwt.cpp	Inverse Burrows-Wheeler transform, using the format used in bwt.cpp and bwta.cpp.
unmtf.cpp	The Move to Front decoder, companion to mtf.cpp.
unari.cpp	An order-0 adaptive arithmetic decoder, using the code from the 1987 CACM article by Witten, et.al.
readme.txt	Build instructions, short and sweet! Please read!
bwtcode.zip	The whole package in a ZIP file.

Data Compression with the Burrows-Wheeler Transform

by Mark Nelson

Dr. Dobb's Journal