The magazine of the Melbourne PC User Group

Data Compression
Major Keary

Very few computer users do not meet data compression in some form or other. Backup programs and the increasing popularity of packages such as Stacker bring us into contact with techniques essential to many aspects of modern computing. Bulletin board users will be familiar with compressed files, but those who feed documents into the ubiquitous fax machine are likely to be oblivious to the fact that it compresses data before transmission.

There are two kinds of compression: one performed by software (such as PKZIP) and the other by hardware (for example, fax, newer modems, and tape drives). Here I will be talking mainly about software compression.

The term compression has come to mean that a collection of data (a file) can be reduced in size and restored to its original form - but not always. A compressed executable file must, when uncompressed, be exactly the same as the original, whereas graphics and sound files can afford to lose some of their original data. Thus we have, under the head of software compression, two types: lossy and lossless.

A Bit of History

Data compression is at the core of Information Theory (IT), which is not to be confused with Information Technology. The father of IT-a branch of mathematics-is Claude Shannon; another person was working on improved coding techniques at the same time: Robert Fano. Thus the term Shannon-Fano Coding. At the time, the late 1940s, their work had nothing to do with digital computers, but was to become the first effective method of data compression.

The system gave way to Huffman, created by David Huffman whose original work was the subject of many improvements and variations by others. Huffman is, in various versions, the de facto fax compression standard and is the foundation of MNP-5 (Microcom Networking Protocol).

Arithmetic coding has been a significant advance, enabled by the introduction of more powerful processors.

All of those compression systems are based on statistical modelling and continue to be improved.

The most important development in lossless data compression has flowed from the work of two Israeli information theorists, Jacob Ziv and Abraham Lempel. They are the pioneers of dictionary compression and, in 1977 and 1978 respectively, published two papers describing compression techniques known as LZ77 and LZ78.

Developments in other fields created a need for different kinds of data reduction. Digitised audio has been around since the 1960s and much research has gone into encoding and compression to produce lossy systems.

Graphics files can be gargantuan; lossless compression makes an impact, but still leaves many files too big to be moved easily. By the late 1980s a hardware solution was available in the form of add-on cards. Then the International Standards Organisation (ISO) and International Telegraph and Telephone Consultative Committee (CCITT, which derives from the French name of the Committee) stepped in. They formed a Joint Photographic Experts Group (JPEG) which arrived at a standard. Since then MPEG has been formed to work on moving picture compression.

For the benefit of purists, CCITT now goes under another name, ITU-T, but its initials will remain tied to various communications standards for a long time. The ITU-T stands for International Telecommunications Union.

Crude, but Effective

There is a simple compression system, still used for some graphics files, known as RLE, Run Length Encoding. It works on the fact that image files have lots of zeroes. All it does is count up consecutive zeroes and replace them with a code that, in effect, says "put so many zeroes here". The JPEG algorithm uses RLE as part of its process.

The Statistical Approach to Compression

The first systems used statistical methods in a variety of ways. The principle is that some characters appear more frequently than others. If a table is set up to grade the characters in order of frequency of use, then we can allocate minimum-bit codes to the characters with a high probability rate (which means the most frequently used ones). The lower the probability rate of a character, the more bits can be used to represent it.

As you probably know, characters and symbols are normally represented by 8-bit codes. By using a variable-length code we can represent at least two characters by a single bit, thus saving space in the order of 87 per cent. However, some characters will have, perhaps, 10-bit codes; if they are, for example, x and z, that will hardly matter.

If that sounds confusing, think about Morse Code; the letter e is represented by a single dot, and t by a single dash. That is not accidental. They are the most frequently-used letters in the alphabet, or letters with a high probability rate.

The next step is to cater for the fact that some characters have different probability rates according to their relative position. Thus, u does not have a particularly high probability rate, but there is an exceptionally good chance that it will be the next letter after q.

In statistically-based systems each collection of data is analysed and a unique probability table created. Now, that does not mean the system is guessing; the table allocates variable-bit codes to characters according to the statistical analysis. The probability table has to be attached to the beginning of the compressed file, otherwise the receiver will not be able to decode it. The next step is to encode the data, which also requires that another collection of information has to be attached to the file.

The Arithmetic Approach

The Huffman system had one serious problem. Statistical results usually don't work out in whole numbers, and that makes for less efficient compression. I won't try to explain that here, but arithmetic coding takes advantage of the floating point ability of modern processors and represents a quantum step in compression efficiency.

The Dictionary Approach

The concept of dictionary compression is something we encounter daily. The ubiquitous acronym is a token for something much longer - for example, ASCII - and IBM has long since abandoned its original name for the initial letters.

Most popular compression utilities are based on the LZ77 algorithm, which uses a moving window technique. Imagine a page of text covered by a mask, and the mask has a slot that can be moved along each line in turn to expose a few words at a time. In other words, a window of fixed length allows the reader (in this case, the compression program) to view some of the data. The program looks for repeating groups (usually called phrases), substitutes tokens for them, writes the result into a dictionary, and encodes the data from the window. Unlike Huffman, which makes two passes, the window-dictionary method does everything in a single pass.

That is very much an over-simplification. The LZ77 algorithm has an inherent weakness: the system uses a small window, the effect of which is to miss or even discard phrases suitable for the dictionary. Improvements by others have produced PKZIP, LHA, and ARJ - all based on LZ77 variants. Some, LHA in particular, use Huffman in a second pass to further compress the output stream.

The variant of LZ77 used in popular compression software is known as LZSS; I am not sure, but suspect it came from the work of Stac, founder of Stac Electronics which publishes Stacker and licenses the LZS algorithm to credit card providers.

When you use your American Express card in a Kathmandu restaurant the debit is likely to be on your account before you walk out the door; it's all done with LZS.

Terry Welch and LZ78

Someone tried to tell me that Lempel and Ziv were Unisys employees when they published their work. They have been even further maligned: the first Hijack manual says Lempel-Ziv is one person.

Even though Unisys has some data compression patent rights, the company acquired them through the work of Terry Welch - the W in LZW - who was an employee of Sperry Research Center before Sperry and Burroughs merged, in 1986, to form Unisys.

To return to the subject, LZ78 departs completely from the sliding window concept. It treats the file as a whole and is able to build up a dictionary for very long phrases. There are problems with that system, but I won't attempt to go into them here. The UNIX program Compress uses the Welch variant of LZ78.

Indeed, LZ78 was more popular than LZ77, but it seems that what became known as the ARC Wars - patent disputes and other acrimony - resulted in attention being diverted to LZ77 and ways of improving its capacity. Some of you will remember PKARC; it was based on LZ78, but fell in the skirmish and was' replaced by PKZIP.

Variants of LZW are used widely-in the GIF format, for example-and it is the foundation of V 42bis.

QIC-122

Tape drives are becoming more popular for backing up data and represent an example of hardware compression. The standard system is known as QIC-122; QIC stands for Quarter Inch Cartridge, an industry association of tape-drive manufacturers. The system relies on a chip designed by Stac Electronics and which uses an LZ77 variant.

We can expect to see the technology transferred to hard drives, thus eliminating the need for compression software. However, there are a number of problems to be overcome first.

Sound Compression

I would like to be able to tell you that speech compression chips are being developed for implant into the voice boxes of politicians. Alas, no one is working on such a project.

Digital techniques for sound date back to the 1960s and have reached a high degree of development. Compression of speech can be achieved with normal lossless programs, such as LHA which incorporates Huffman with LZSS.

Lossy techniques give better compression rates, and they can be further improved by further compression with a lossless program. There are many technical considerations that are beyond the scope of this article.

Most of the work seems to be directed towards speech compression. One of the driving forces is multimedia where spoken words represent a significant part of many packages.

Lossy Image Compression

Most of us have come up against huge graphics files, particularly as PCs acquire more processing power to handle high resolution and millions of colours. Even though lossless compression can be used to some effect the compression ratios are not sufficiently high to make such files easily transportable.

Lossy compression relies on the fact that pictures can lose some of their detail without any perceivable degradation; the trick is to find a system that will do the job effectively and quickly. In multimedia applications graphics files do not only have to be reducible; the compression program has to display the image very quickly.

Thankfully, there was a recognition of the need for a standard and ISO with CCITT (now ITU-T) set up JPEG, Joint Photographic Experts Group.

The next pressing problem was video image compression. That is being addressed by MPEG, Moving Pictures Experts Group. If you thought digitised still pictures create big files, try video images they have a voracious appetite for space.

The methods used include techniques for lossless compression, but become highly sophisticated; for example, using a diagonal path to examine pixel sequence. Why diagonal? Well, it gives longer runs of data than does simple horizontal row-by-row scanning.

The Literature

A lot of papers have been published, and there are the draft standards and final standards published by ISO. Hardly light reading. There have been a small number of books published (four that I know of) and some are in preparation, at least one of which is being written by an Australian authority on JPEG.

The most up-to-date book presently available is Mark Nelson's The Data Compression Book. The current edition comes with a disk containing sample code in C illustrating the various techniques (including lossy graphics and sound compression). There is sufficient information to write one's own archival program.

The book itself is well-written; while it does not talk down to non-technical readers, the language is clear, free of jargon, and easy to follow. For those who seek an in-depth understanding the author provides sample code (also contained on the disk) written in C; it illustrates various technical points. However, while familiarity with the C language is desirable, it is not essential for general readers who are interested in comprehending the principles.

For C programmers and those learning it, or wanting to develop their skills, the book provides a useful insight into the structure of working models.

Horses for Courses

Compression is a matter of horses-for-courses; each situation has its own special requirements and programs can be written for particular applications. For backups speed of compression is essential. It is not so noticeable if a program is a bit slow on restoring backups because - at least one hopes - they will be needed in only rare emergencies.

On the other hand, if it takes a little longer to compress a file efficiently for a BBS that is of no consequence if the many people who download it can uncompress it the wink of an eye.

Disk compression applications have to be able to pull data into a readable buffer very, very quickly if the program is to attract buyers. It also has to be able to install applications to a compressed disk without undue loss of time. On top of that it has to be as close to bullet-proof as possible. That, of course, is why Stacker holds a pre-eminent place in the disk compression market.

Data representing sound, graphics (static or animated), and moving pictures requires a trade-off between loss of some data (lossy) and manageable file size. With the rapid development of multimedia it is essential for programmers to be able to write compression routines into applications.

Mark Nelson explains all of those considerations in language that will satisfy the ordinary reader as well as those who need technical information, either as programmers or as system managers who commission programmers. His book should be in any library-public, educational, individual, or corporate -that services computer users whether they are the just-interested, programmers, information managers, or information theorists.
Mark Nelson: The Data Compression Books 
ISBN 1 55851 216 0 
527 pages plus disk 
Published by M&T Books 
RRP $79.95

Reprinted from the July 1994 issue of PC Update, the magazine of Melbourne PC User Group, Australia

[About Melbourne PC User Group]