The magazine of the Melbourne PC User Group

Build a "Quick and Dirty" Search Engine
with Python and PHP

Peter Dann
pdann@melbpc.org.au

Occasionally I have found a need to locate items within a large collection of HTML documents that I have been unable to search using a conventional search engine.
 
This article describes qdse, a "quick and dirty search engine" built to address this need. The program provides full text searching of HTML documents that may be located locally on a user's machine (for example, programming documentation), on a corporate intranet (for example, product information), or on the Internet.

The version described here runs locally on the author's Windows machine, where it is used to search the documentation for PHP, a server-side scripting language, stored on a locally installed Web server. The program can easily be adapted to other uses.

While qdse is a useful tool in its own right, it also illustrates in one compact application, key features of some powerful but freely available software and programming tools that could be of interest to any Web developer, including:

  • Python - the programming language qdse uses to create a document index.
  • mySQL - the database qdse uses to store the document index.
  • PHP - the server-side scripting language qdse uses to interrogate the document index and present details of hits on a Web page.
T his article describes the overall design of qdse and explains key sections of the underlying Python and PHP code. While this is not a full-blown tutorial for either language, it should provide a useful insight into how these tools can be used to build a real Web application.

The Key Components

qdse consists of three components:
  1. A mySQL database which stores:
    • the URLs and titles of all the documents to be searched
    • a record for each unique word that appears in these documents
    • records indicating which words appear in which documents
  2. An indexing program written in Python, which creates and populates the tables in the document index database
  3. A Web page user interface where the user enters a list of terms for which to search, and which displays the search result as a list of hyperlinked hits. This interface uses PHP to interrogate the document index database and present the resulting hits to the user.

As mentioned above, mySQL, Python and PHP are all available at no cost. For more details, see "Obtaining the core components" at the end of this article.

The Document Index Database

At the heart of qdse is the document index database that stores details of the documents to be searched.
In qdse, this database runs under mySQL, an open-source database program which is a popular choice for Web development projects. mySQL understands standard SQL - a widely-used language for querying databases. It is well-documented, is available in both Windows and Unix/Linux versions and integrates easily with PHP, the language in which qdse's user interface is written.

mySQL does not implement some features associated with fully-featured databases, including support for transactions and foreign key relationships, and is optimised instead for speed. These limitations are not a problem in many Web applications, including qdse.

mySQL runs as a server daemon under both Windows NT and Unix/Linux. Users can communicate with the database server using client programs such as MySqlManager that come with the application, as well as through directly through code (for example, using Python and PHP). 

Tables Structure

qdse's document index database consists of three tables.

Table "urls" lists the URL and title of each document the application has indexed, together with a unique ID for each record. Once a site has been indexed and is ready to be searched, this table might contain entries such as those depicted in figure 1 as follows:

id url title
1 http://localhost/phpmanual/appendicies.html Appendicies
2 http://localhost/phpmanual/calling-user-functions.html Calling User Functions
3 http://localhost/phpmanual/class.dir.html Untitled
4 http://localhost/phpmanual/config-apache.html Apache Module

Figure 1. urls table

Table "words" contains one entry for each unique word found in the indexed documents. No word appears more than once, and each word has a unique ID. Once a site has been indexed, this table might contain entries such as those in figure 2 as follows:

id word
1 php
2 manual
3 appendicies
4 table

Figure 2. words table

Table "words_x_url" shows which words appear at which URL. This table refers to words and URLs by their unique IDs. Once populated, it might contain entries like the following:

word_id url_id
179 6
180 6
1 4
2 4
171 4

Figure 3. words_x_url table

The first entry shown above indicates that in this particular document index database the word whose ID is 179 appears in the document whose ID is 6. 

Querying the Database Manually

When developing a database application we often need to pass queries to the database manually, rather than through a programming language. When running mySQL under Windows, an easy way to do this is use mySqlManager, a database client program that comes with the Windows distribution.

For example, to find the word that has ID 179 we can pass an SQL query to the database. We type the query into the Query tab of MySqlManager (Figure 4) then click the small triangular icon to execute the query. In this particular database, the word whose ID is 179 is "virtual". Result shown in figure 5.


Figure 4. Entering an SQL query in MySqlManager. 
The results are available in the Results tab.


Figure 5. Viewing the results of the query.

To get the IDs of all the documents this which this word appears, we would execute an SQL query such as the following:
SELECT url_id FROM words_x_url 
WHERE word_id = 179


The Indexing Program

So, how are the tables described above created and populated in the first place?
 
While we could enter the information manually using an application like MySqlManager, obviously this would be time consuming and error-prone. What we need is a program that will do the work for us. This program should:

  • create the tables described above
  • identify all the HTML documents to be indexed
  • open each document in turn - extract its title and a list of all the unique words in the document
  • insert a record in the urls table for every document opened showing its URL and title
  • insert a record in the words table for each new word encountered
  • insert a record in the words_x_url table each time the program finds a word (in a document) not previously found in that document.
Perl or Python?

While there are many languages in which it would be possible to write such a program, two obvious candidates for this task are Perl and Python. 

Both of these so-called "glue languages" excel at processing large quantities of text. Both offer excellent facilities for extracting information from one place (for example, from a file or Web resource) and placing it somewhere else (for example, into a database table). Both languages are interpreted, rather than compiled, which tends to speed development. Both languages are distributed under an open source licence and are available for free.

Perl's strengths include its large user base and the extensive range of libraries available for performing specialised tasks (for example, creating and manipulating PDF files). On the other hand, Perl's syntax is notoriously obscure and this tends to make it difficult for anyone who is not programming in Perl fulltime to maintain Perl code. Perl's support for object orientation is rather inelegant.

Python shares many of Perl's strengths, while offering some additional benefits, including clean support for object-orientation. One unusual but surprisingly helpful feature of the language is that white space is significant: the level of code indentation indicates how lines of code are organised into blocks. At first glance this may seem like a trivial "advantage", but it is an advantage one quickly comes to appreciate when actually working with the language. Add a variable naming convention that allows the use of plain text variable names and the result is code that is pleasantly clean and easy to maintain.

Creating the Tables

The first task performed by qdse's indexing program 'qdse.py' is to create the database tables described above. 

Listing 1 shows the code that creates the words table:

Line 1 imports (ie. makes accessible to our program) a Python module called "MySQLdb". This module provides us with objects and functions we need to access a mySQL database. 
Line 2 declares a global variable that holds the name of our document index database.
Lines 4 to 13 define a function called "create_WORDS_table()". Assuming that we have already created an empty database called qdse using MySqlManager, line 5 makes a connection to this database, while 
line 6 creates a cursor object we will use to pass queries to the database.
In lines 7 to 11 we construct the query that will create the words table, before executing the query in line 12.


Listing 1. Python function to create a database table.

It is worth noting that the table we have created has an index for the "word" field (line 10). This dramatically improves the performance of the indexing program when looking up or inserting records in the words table.
qdse uses code similar to that shown in Listing 1 to create the other tables the application requires.

Identifying the Documents To Be Indexed

In any search engine indexing program a key routine is the function which determines which pages are to be indexed, and can therefore be searched by the user. Fully-fledged search engines use a range of different approaches to this task. So-called "spiders", for example, crawl through page after page identifying links, and then following these links to find yet more links, and so on.

The original version of qdse was designed to search 1100 documents on an intranet site, the titles of which were listed, as hyperlinks, on a single page.
 
The version published here takes a similar (and equally simple) approach, identifying the pages to be indexed, and hence made searchable, by parsing the virtual index page returned by the Web server when passed a directory name.
 
On the author's machine the documentation for PHP is installed in a directory named "phpmanual", immediately beneath the document root of a locally running version of the Apache Web server. Entering the URL "http://localhost/phpmanual/" in a Web browser causes Apache to send a virtual index page listing all the files in the directory "phpmanual". It is this virtual index page that "qdse.py" reads and parses to determine which pages to index for searching. Listing 2 shows the function responsible for this task.


Listing 2 Python function to extract list of URLs from HTML page shows the function responsible for this task.

The square brackets at the end of line 2 indicate that the variable "result_list" is initialised to hold an empty list. Lists are one of Python's key intrinsic data types, and the language offers powerful list-manipulating facilities.

Line 3 constructs a path to our virtual index page, which we open in line 4. The code at line 5 reads each line of the page into a list called "lines".

Lines 6 to 11 create a "regular expression object". Regular expressions are patterns used to find, extract or alter content in a line of text that matches the pattern. Regular expression syntax is quite complex, and I won't attempt to explain it in detail here. Suffice it to say that here we are looking for the shortest section of text in any line which is preceded by a double quote mark (line 7) and ends with ".html" (line 10).
 
If found, such text will be stored as a so-called "group" called "url" (line 8).
 
Line 12 sets up a loop allowing us to examine in turn each line of the virtual index page opened at line 4.

Within the loop, line 13 tests whether there is any text present in the line which matches our regular expression. If so, it stores information about the match in the variable "result". Line 15 extracts the actual matched text and stores it in the variable "url". Line 16 converts the URL from a relative pathname to a fully-qualified pathname.

The the function "get-urls_list()" is quite adequate for the purpose for which qdse was originally designed. However, this is one routine it would be necessary to extend if adapting qdse to search a more complex site.

Parsing Each Document

Once the indexing program has created a list of all the pages we wish to index, it needs to open each one in turn and extracts:

  • the document's title

  • a list of all the unique words in the document

Assuming the variable "urls_list" contains the fully-qualified URLs of all the documents we wish to index for our search engine, the code in Listing 3, along with some helper function we will discuss shortly, does the job. 


Listing 3. Master routine for extracting title and word list from HTML documents.

In lines 1 to 3 we open each document in turn and read its entire contents into the variable "text". This variable contains the whole of the document, including all HTML tags.

In line 4 we create an HTML parser object of class myParser(). This class, defined elsewhere in "qdse.py", is derived from Python's HTMLParser class and extended to provide methods allowing us to read the document's title and the text outside all HTML tages. (The implementation of these methods - not discussed here - is surprisingly easy thanks to the elegant design of Python's native HTMLParser class, and requires only a few lines of code.)

After feeding the text of the entire document to our parser object in line 5, we close the parser, then use its "get_text()" method at line 7 to read all the text in the document outside the HTML tags. Similarly, we call the parser's "get_title()" method at line 8 to read the document's title.

At line 11 we pass the text extracted from the document to an object of another user-defined helper class called "text_splitter". This returns the parsed text as a list of unique words, after first performing some filtering to exclude words in which we are not interested (including very long words, and common words such as and and the).

Once we have all this information, we insert the relevant details into our database tables. The code to do this is similar to the code used to create the database tables (although of course the actual SQL is different).

The User Interface

Once the document index database is created and populated, we need some way of searching this information to locate documents that contain particular words.

In qdse, users enter the terms they are searching for in a simple form, then click a Search button. The same page displays a list of hits, as shown in Figure. 6.


Figure 6. The qdse user interface.

Behind the scenes, the work of constructing appropriate database queries and then interpreting and displaying the results is done by PHP. 

A Quick Introduction to PHP

PHP is a free, open-source server-side scripting language and environment. PHP is broadly similar in its capacities to Cold Fusion and Active Server Pages.

To use PHP, an author embeds PHP-specific tags in an HTML page, as in Listing 4


Listing 4. A simple Web page on the server, including PHP tags.

When a PHP-enabled Web server encounters the tags "<?php" and "?>", it passes the code they enclose to PHP for processing. Depending on the installation, PHP itself may be loaded as a Web server module (like a .DLL), or it may exist as a stand-alone executable. PHP processes the code passed to it then typically returns text to the Web server for inclusion in the page to be sent back to the user. PHP's syntax shares many similarities with C.

Listing 5 shows the page from Listing 4 as it would be sent to a user's browser. The user never sees the PHP code itself, only the results this code generates.


Listing 5. The same page, as received by the browser.

Many different servers are capable of running PHP, including Apache, IIS4 and Netscape. 

The Interface Code

qdse's user interface consists of two pages:

  • "search.php", a page of HTML code with embedded PHP tags. This page contains the form where the user enters the terms to be searched for, and is the only page the user sees.

  • "search.php.inc", a page of PHP code which contains several helper functions called from "search.php".

Listing 6 shows the code for the form in "search.php" where the user enters the terms to be searched for.


Listing 6. The form where user enters search terms.

In line 1, the action of the form is set to the URL of the current page. This means that when the user clicks the Search button, the current page will be redisplayed.

The interesting part of the form is line 5, which is enclosed in PHP tags. PHP makes the contents of all form variables available as similarly-named program variables. In PHP, all variable names begin with "$". After the user submits this form (causing the current page to be redisplayed), the variable $terms will hold whatever value the user previously entered in the form field named "terms".
 
If the variable $terms does contain a value, the isset() function will return true, causing the "echo" statement to send a stream of text to the browser populating the form's input field with whatever values the user entered before clicking the Search button. The input field will therefore appear to retain the text the user previously entered.

Listing 7, also part of "search.php", shows the master routine which analyses the user's search terms, then calls a range of helper functions to interrogate the document index database and display a resulting list of hits. (Some error-processing routines have been removed to make the code easier to read.)


Listing 7. Master PHP routine for processing search terms and displaying hits.

Lines 3 and 4 split the user's search terms entry into an array of separate words.

Lines 7 and 8 create a connection to mySQL and select the document index database.

Line 11 calls a helper function to create a temporary table where we will insert the IDs of all the documents which match one or more of the user's search terms. (This function has a similar purpose to the code in Listing 1, which created a database table from Python.)

At line 12 we call another helper function to populate the temporary table. This function executes a series of SQL queries, one for each search term the user has entered. For example, if the user has entered the search terms "mysql password", this helper function issues the following queries:



Next, the helper function "get_hits()" retrieves a result object containing the URL and title of each document in which all of the user's search terms appears. To get this information, "get_hits()" executes a query such as the following (adjusting the last line to match the number of search terms):

Finally, at lines 17 and 18, we pass the result object to the helper functions "show_hit_count()" and "show_hits()" to list the number of hits and print a hyperlinked list of document titles. Listing 8 shows the code for "show_hits()".

 


Listing 8. PHP function for displaying list of hits retrieved from document index database.

PHP offers several different ways of analysing the result of a database query. The method used here is to treat each row in the result object as a row object, and then to use the C style object->property syntax we see in lines 3 and 4 to access the data we are interested in.
 
Lines 5 to 7 ensure that if we have been unable to determine a title for any document we print its URL instead. Line 8 actually writes the details of each hit to the page.

Obtaining the Core Components

mySQL, Python, PHP and Apache are all available in Windows and Linux versions.

To learn more and download your own versions, see the following addresses:
http://www.mysql.com/, http://www.python.org/, http://www.php.net/, http://www.apache.org/

Source Code

The source code for qdse accompanies this article. The code consists of three files:
qdse.py, search.php, and search.php.inc which have been grouped into the one file qdse.zip which is available for download from the link
, and then the individual files can be extracted.


About the Author
Peter Dann, pdann@melbpc.org.au is a Melb PC member, Web developer and technical writer.

Reprinted from the December 2000 issue of PC Update, the magazine of Melbourne PC User Group, Australia

[About Melbourne PC User Group]