The magazine of the Melbourne PC User Group

Regular Expressions
Major Keary

If you consult the Encyclopedia of Computer Science it defines Regular Expression thus: "The formal description for a language acceptable by a finite automaton or for the behaviour of a sequential switching circuit is known as a regular expression". Reading that a second time won't make it any the more comprehensible to other than the science literati. The term has a number of applications, but the one that interests us here is computer programming in its widest sense, which includes writing an HTML document even though that is programming at a very simple level.

Ordinary computer dictionaries are unlikely to list regular expression, but texts on UNIX generally have a straightforward description, such as, "A regular expression is a string that describes a pattern of characters", or "Regular expressions are a way to express patterns of characters ...[and] are much more powerful than wild cards - you can think of them as wild cards on steroids". I have seen it referred to as wild strings. Regular expressions, commonly referred to as regex and sometimes as regexp, is a handy system of notation for specifying complex patterns; it is also a language - with its own engines - that is incorporated in many applications. As mentioned further on, the degree of incorporation varies; probably the most comprehensive support is built into Perl. There is no standard; each regex-enabled application has its own peculiarities, but that does not present a practical problem.

There are two distinct engine types, DFA and NFA, that operate in different ways; for most users 'which engine' is of little consequence, but for those who want to know about the difference there is a detailed discussion in Mastering Regular Expressions (mentioned below).

To the best of my knowledge there is only one book in print that has been written specifically about regular expressions: Jeffrey Friedl's Mastering Regular Expressions. Texts on scripting languages, such as Perl, Python, PHP, and JavaScript, discuss regex in varying degrees of detail. There is, for example, a very good chapter on the subject in Danny Goodman's JavaScript Bible and most UNIX texts have some degree - from passing mention to quite extensive - explanation of its application, especially those that describe the vi editor and Emacs.

Who wants to use regular expressions? A popular email program, Pegasus Mail, has limited regex capability that is quite sufficient for mail filtering. For example, the 'From' field in spam email often contains strings of numerals, something not usually found in the addresses used by senders of legitimate e-mail. A filter rule, From*[0-9]* will pick out any email with numerals in the "From" field and deal with such messages according to user preference (delete, move to a separate directory, and so on).

NoteTab, a text editor, has its own 'engine' that enables quite complex problems to be solved. The following is an example to give an idea of how it works. The task is to convert all email addresses to HTML Mailto links; using the find-and-replace dialogue box:
Find: 
Replace with: 
[A-Z_.-0-9]+@[A-Z_.0-9]+
<A REF="mailto:&">&</A>

Changing HTML tags while retaining content is cumbersome with normal find-and-replace commands. For example, you may want to change <h2> tags to <h3> in an HTML document. That can be achieved quite simply using regex in NoteTab, thus:

Find: 
Replace with: 
<h2>{.*}</h2>
<h3>\1</h3>

The curly brackets and their contents form a tagged match word. In this case the period (.) represents any single character, and the following asterisk (*) matches zero or more matches of the regular expression. In short, it will match any string found between <h2> and </h2>. In the replace field '\1' represents the match found by the find command; the program effectively leaves the string intact and changes the opening and closing tags. There are special symbols for finding a wide range of matches (\s finds a space; \t finds a tab; \b finds any blank space including space, tab, or form-feed; and \h finds any hex character). The examples above are just some of those in NoteTab; other programs vary in their respective usages and have smaller or larger repertoires.

Many big-gun programming languages require the user to learn everything, from A to Z; the attractive thing about regular expressions is that you don't need to learn any more than is required for your needs.

An example of regex is given in Danny Goodman's comprehensive JavaScript Bible. The problem is to insert separators in strings of numerals; large integers are usually stored in databases without 'punctuation', which makes them difficult to read. A procedure can be devised, using regular expressions, that will solve the problem. The code shows how the three things - HTML, JavaScript, and regex-are combined. It requires JavaScript to define a function, commafy, and looks like Listing 1 below.

If you try that example and it doesn't work properly, check that the quote marks are hex 22 (decimal 34) and not printer's quotes (hex 93-94, dec 147-148); the books don't always mention it, but non-standard quote marks play havoc in HTML files and in JavaScript.

Powerful (and inexpensive) text editors such as UltraEdit and NoteTab are regex-enabled, as are StarOffice and JavaScript. The UNIX command, grep, means "get regular expression" and any application with UNIX/Linux antecedents will usually have regex capabilities.

Simple regex applications involving text files do not require a lot of skill; after that the degree of possible complexity escalates. Complex problems often require complex solutions, but if the problem occurs in a very large body of text, or presents itself regularly, the effort put into a fully automated regex solution can be a small price to pay.

The thing about regex is that once you have mucked around with simple tasks and discover workable solutions, moving on to more complex levels can become an interesting challenge. Jeffrey Friedl, in his Mastering Regular Expressions, takes the reader through an amazing regex journey, probing every nook and cranny, and turning over every rock. I am sure that even the most experienced user of regular expressions will find something new, or a different, and more efficient, solution. The book is in the context of using regex in Perl, which is very helpful for Perl users, but the concepts are the same for all applications. Regex novices who have a sound knowledge of one or more programming languages will find this a valuable introduction and ongoing resource. Those without a programming background may find it somewhat overwhelming; they should read the chapters on regular expressions in books such as JavaScript Bible, those about StarOffice, and UNIX texts. Then return to Mastering Regular Expressions. It is well worth developing an acquaintance with regex, even if at an elementary level.

Jeffrey Friedl: Mastering Regular Expressions
ISBN 1-56592-257-3
Published by O'Reilly, 
342 pp. RRP $99.95

Reprinted from the December 2002 issue of PC Update, the magazine of Melbourne PC User Group, Australia

[ About Melbourne PC User Group ]