The magazine of the Melbourne PC User Group
Regular Expressions
Major Keary |
|
If you consult the Encyclopedia of Computer Science it defines Regular Expression thus: "The
formal description for a language acceptable by a finite automaton or for the behaviour of a sequential
switching circuit is known as a
regular expression". Reading that a second time won't make it any the more comprehensible
to other than the science literati. The term has a number of applications, but the one that interests us
here is computer programming in its widest sense, which includes writing an HTML document even though that
is programming at a very simple level.
Ordinary computer dictionaries are unlikely to list regular expression, but texts on UNIX generally
have a straightforward description, such as, "A regular expression is a string that describes a pattern
of characters", or "Regular expressions are a way to express patterns of characters
...[and] are much more powerful than wild cards - you can think of them as wild cards on steroids". I
have seen it referred to as
wild strings. Regular expressions, commonly referred to as regex and sometimes as
regexp, is a handy system of notation for specifying complex patterns; it is also a language -
with its own
engines - that is incorporated in many applications. As mentioned further on, the degree of
incorporation varies; probably the most comprehensive support is built into Perl. There is no standard;
each regex-enabled application has its own peculiarities, but that does not present a practical problem.
There are two distinct engine types, DFA and NFA, that operate in different ways; for most users 'which
engine' is of little consequence, but for those who want to know about the difference there is a detailed
discussion in
Mastering Regular Expressions (mentioned below).
To the best of my knowledge there is only one book in print that has been written specifically about
regular expressions: Jeffrey Friedl's
Mastering Regular Expressions. Texts on scripting languages, such as Perl, Python, PHP,
and
JavaScript, discuss regex in varying degrees of detail. There is, for example, a very good chapter
on the subject in Danny Goodman's
JavaScript Bible and most UNIX texts have some degree - from passing mention to quite extensive -
explanation of its application, especially those that describe the vi editor and
Emacs.
Who wants to use regular expressions? A popular email program, Pegasus Mail, has limited regex
capability that is quite sufficient for mail filtering. For example, the 'From' field in spam email often
contains strings of numerals, something not usually found in the addresses used by senders of legitimate
e-mail. A filter rule, From*[0-9]* will pick out any email with numerals in the "From" field and deal
with such messages according to user preference (delete, move to a separate directory, and so on).
NoteTab, a text editor, has its own 'engine' that enables quite complex problems to be solved. The
following is an example to give an idea of how it works. The task is to convert all email addresses to
HTML
Mailto links; using the find-and-replace dialogue box:
Find:
Replace with:
|
[A-Z_.-0-9]+@[A-Z_.0-9]+
<A REF="mailto:&">&</A>
|
Changing HTML tags while retaining content is cumbersome with normal find-and-replace commands. For
example, you may want to change <h2> tags to <h3> in an HTML document. That can be achieved quite
simply using regex in
NoteTab, thus:
Find:
Replace with:
|
<h2>{.*}</h2>
<h3>\1</h3>
|
The curly brackets and their contents form a tagged match word. In this case the period (.) represents
any single character, and the following asterisk (*) matches zero or more matches of the regular expression.
In short, it will match any string found between <h2> and </h2>. In the replace field '\1'
represents the match found by the
find command; the program effectively leaves the string intact and changes the opening and closing
tags. There are special symbols for finding a wide range of matches (\s finds a space; \t finds a tab; \b
finds any blank space including space, tab, or form-feed; and \h finds any hex character). The examples
above are just some of those in
NoteTab; other programs vary in their respective usages and have smaller or larger repertoires.
Many big-gun programming languages require the user to learn everything, from A to Z; the attractive thing
about regular expressions is that you don't need to learn any more than is required for your needs.
An example of regex is given in Danny Goodman's comprehensive JavaScript Bible. The problem
is to insert separators in strings of numerals; large integers are usually stored in databases without
'punctuation', which makes them difficult to read. A procedure can be devised, using regular expressions,
that will solve the problem. The code shows how the three things - HTML, JavaScript, and regex-are
combined. It requires JavaScript to define a function,
commafy, and looks like Listing 1 below.
If you try that example and it doesn't work properly, check that the quote marks are hex 22 (decimal 34)
and not printer's quotes (hex 93-94, dec 147-148); the books don't always mention it, but non-standard
quote marks play havoc in HTML files and in JavaScript.
Powerful (and inexpensive) text editors such as UltraEdit and NoteTab are regex-enabled, as are
StarOffice and JavaScript. The UNIX command, grep, means "get regular expression"
and any application with UNIX/Linux antecedents will usually have regex capabilities.
Simple regex applications involving text files do not require a lot of skill; after that the degree of
possible complexity escalates. Complex problems often require complex solutions, but if the problem occurs
in a very large body of text, or presents itself regularly, the effort put into a fully automated regex
solution can be a small price to pay.
The thing about regex is that once you have mucked around with simple tasks and discover workable solutions,
moving on to more complex levels can become an interesting challenge. Jeffrey Friedl, in his
Mastering Regular Expressions, takes the reader through an amazing regex journey, probing every
nook and cranny, and turning over every rock. I am sure that even the most experienced user of regular
expressions will find something new, or a different, and more efficient, solution. The book is in the
context of using regex in
Perl, which is very helpful for Perl users, but the concepts are the same for all
applications. Regex novices who have a sound knowledge of one or more programming languages will find this
a valuable introduction and ongoing resource. Those without a programming background may find it somewhat
overwhelming; they should read the chapters on regular expressions in books such as
JavaScript Bible, those about StarOffice, and UNIX texts. Then return to Mastering Regular
Expressions. It is well worth developing an acquaintance with regex, even if at an elementary level.
Jeffrey Friedl: Mastering Regular Expressions
ISBN 1-56592-257-3
Published by O'Reilly,
342 pp. RRP $99.95
|
 |
Reprinted from the December 2002 issue of PC Update, the magazine of Melbourne PC User Group, Australia
|