|
Data compression is an essential aspect of connectivity; modems have inbuilt
compression facilities, it is used in digital telephony, and without compression many multimedia and graphics
files - especially video - would be too big for practical use.
There are two distinct forms of compression, lossless and lossy. The first is necessary for
data that has to be restored to its exact original form (for example, text, spreadsheet, and database); lossy
compression is used for data that can shed some of its detail without materially affecting the restored file
(for example, images, video, and audio).
Can Compression Get Better?
Lossy compression techniques have undergone remarkable improvements. New algorithms enable surprising degrees
of effective guessing, which means that when a compressed file is being restored the software can replace
discarded data with something very close to the original. Early lossy systems simply discarded some data
before compressing a file; the discarded data was lost forever. Now there are systems that do a very good job
of replicating the original with the help of some complex mathematics.
Lossless compression is another matter; serious discussion of possible significant improvements has been left
to the habitues of specialist Usenet groups. Every now and then a new concept is put forward, such as
reported in The Age of 2 August 1994 following a visit to Australia by Phil Katz. The Age
report said: "Now that we have the computing power he (Phil) is excited about the prospect of entering a
higher level of data compression, using the second order of Markov models, which will shrink a 1MB text file
to about 150KB, compared to between 300 to 40OKB in the latest version of PKZIP"
That exciting prospect made me sufficiently curious as to e-mail Phil to ask which Markov he was talking
about, and was such a quantum leap in lossless compression in fact imminent?
A response from PKWare indicated that "Phil was not sure about which Markov",but someone would get back tome.
There the correspondence rested until, some time later, I raised the subject again seeking news of the
spectacular Markov comet that promised to light up our compression skies.
Markov Chains
The term, Markov models, suggested the work of Andreii Andrecvich Markov 1856-1922 (his given names are
sometimes spelled, Andrey Andreyevich) a Russian mathematician known for his contribution to the development
of the theory of stochastic processes, some of which-Markov chains-carry his name. Stochastic process is
usually associated with probability, and probability is important to information theory from which data
compression derives.
Markov chains were applied to early work in speech recognition, and speech and music synthesis, a connection
that led me to think Phil may have been talking about lossy audio compression rather than data compression.
My assumption, however, was wrong; Steve Burg, Vice President, Engineering, of PKWare Inc. responded to my
follow-up enquiry:
"The Markov models that Phil was referring [to are] based on the work of Andreii Markov in speech
recognition. Most popular compression algorithms at the present time use no information from the current
context to improve compression efficiency. Markov modelling allows improvements by providing predicted future
content. As a simplified example, a person can often predict the end of a sentence based on the start of the
sentence and previous context."
It seems there had been some progress, but there was a problem with achieving the goal of 100 per cent
accuracy, as well as writing an application sufficiently robust and stable for commercial release. Lossless
compression has to restore a file to its exact original form: 99.9 percent is not good enough. It is not
surprising that the project has not been further mentioned.
Is It a Comet, Or Is It a Scam?
A number of compression scams have attracted the attention of the comp.compression newsgroup and it is
worth reading the FAQ for accounts of what jean-Loup Gailly calls, "The WEB 16:1 compressor, OWS and other
illusions". OWS (the initial letters of the program's author) does not compress data, but "remembers the
clusters which contained the data ... which ... can be recovered only if those clusters are not used again
for another file".
Another, WEB 16:1, had its claims published in Byte (June 1992 Vol 17 No 6): " ... the compression
algorithm used... is not subject to the laws of information theory . ... ". As the laws of information theory
have not yet been repealed, the program simply did not work and has not been heard of again.
On a more realistic note, the The Age of 9 May 2000 published a comparison between
WiriZIP/PKZIP and RAR. The author of RAR does not disclose the algorithm(s) employed, but clearly uses
several, among which an LZ77 variant and Huffman are sure to be present. The program offers an option,
solid, that adopts the same technique used by hard disk compression utilities such as Drive
Space.
Depending on the operating system and file management arrangements files do not generally stop at the point
where the user closes them. Under DOS (which is the underlying operating system for Win95 and Win98) a group
of files can occupy space much larger than the sum of the data they contain. Lots of small files can chew up
big chunks of hard disk space. There are ways of minimising such loss, but that is another subject.
In order to overcome the problem RAR offers a solid compression option. It concatenates the files selected
for archiving, thus eliminating the accumulated trailing space, and requires just one header for the archive.
Savings on unused space will not make much difference to the size of an archive, regardless of the utility
employed.
RAR seems to make its saving in two ways. First, it eliminates multiple headers; if each file added to an
archive is kept as a discrete, stand-alone, item then each one requires its own header. Second, if files are
of alike kind they are likely to contain repeated patterns of data that can be replaced by tokens. For
example, if a group of MS Word files are each saved with font information, and they all use the same fonts,
then large blocks of data will be repeated. If the files are concatenated that represents a significant
opportunity for a dramatic compression ratio.
There is quite a body of literature on data compression, but much of it is in professional journals. Two
books in particular are well worth looking at if you are interested in how the systems work.
Mark Nelson and Jean-Loup Gailly: The Data Compression Book 2/e (ISBN 1-55851-434-1, RRP $69.95) is
written for programmers. However, while the extensive illustrative code is written in C, one does not have to
be a programmer to follow the excellent descriptions of how various systems work. Published in 1966, the book
stops short of recent developments. Even so, for developers or experimenters who want to write compression
routines into a program this is an essential text. It covers both lossless and lossy techniques. A book that
ordinary readers can tackle with confidence.
| Khalid Sayood: Introduction to Data Compression (ISBN
1-55860-346-8, RRP $130) was written to provide a single text for an introductory course on data compression.
It covers lossless and lossy systems and provides photographic images to illustrate the original and end
product from various lossy systems.
This is a professional title and assumes some familiarity with linear
algebra. Anyone seeking detailed mathematical descriptions and explanations of compression algorithms will
find this a valuable resource. It includes a brief introduction to information theory and describes various
models (physical, probability Markov, and composite source). There is also an interesting discussion of the
human visual systems and auditory perception.
|

|
Chapters deal with Huffman coding, arithmetic coding,
dictionary techniques (LZ77 and LZ78), lossless image compression, scalar quantization, vector quantization,
differential encoding, sub-band coding, transform coding, speech compression, and video compression.
Teachers and students of computer science courses should Introduction to Data Compression well worth
adding to their reading lists. If your linear algebra is a bit rusty, the author provides a good refresher
tutorial. A copy is in the library for those would like to visit the engine room of data
compression.
| Passing of Two Notable Compressionists
This year has seen the sad passing of two
people whose names have become part of the vocabulary of data compression: Phil Katz and David Albert
Huffman. Phil Katz was an implementer and David Huffman a deviser of compression systems. Without the
algorithms invented by people like David Huffman there would be nothing for people like Phil Katz to
implement, which is not be taken as lessening the importance of their contribution.
There is another side to the coin. When Abraham Limpet and Jacob Ziv published their two famous algorithms,
LZ77 and LZ78, they had nothing to say about how their work could be implemented; it is one thing to devise
and describe an algorithm, but quite another matter to put it to work.
It was sad to hear of their passing; Phil was in his prime and David was certainly too young to have his life
cut short.
|
Reprinted from the June 2000 issue of PC
Update, the magazine of Melbourne PC User Group, Australia
|