Friday, November 18, 2005

More about searching

More about searching

Searching your web site has never been easier--an introduction to search methods. (From Linux Journal #70, February 2000)

The CGI (``common gateway interface'') standard was originally designed to allow users to run programs via the Web, which would otherwise be available only on the server. Thus, the first CGI programs were simple interfaces to grep and finger, which received their inputs from an HTML form and sent the HTML-formatted output to the user's browser.

CGI programs, and server-side programs in general, have become more sophisticated since then. However, one application is as useful now as it was in the past: the ability to search through a web site for documents containing a particular word or string.

While search sites (now called ``portals'') make it possible to browse through a large collection of pages spread out over a number of servers, the CGI programs handling the search have an easier job. They have to go through files only on the local server, producing a list of URLs matching the user's request.

This month, we will look at how to implement several different types of search programs. While these programs might not compete successfully with ht://Dig and Webglimpse, they do offer some insight into how these sorts of programs work, and the trade-offs programmers must make when writing such software.

Simple Command-Line Search

Perl has long been my favorite language for writing server-side programs. This is in no small part due to its strong text-handling capabilities. Perl comes with a rich regular-expression language that makes it easy to find one piece of text inside another.

For example, the following one-line program prints any line of test.txt containing the word ``foo'':

perl -ne 'print if m/foo/' test.txt
The -n switch tells Perl not to print lines by default, and the -e switch allows us to insert a program between the single quotes ('). We instruct Perl to print any line in which the m// (match) operator finds the search string. We can accomplish the same thing inside of a program, as shown in Listing 1.

Listing 1

Of course, the above program searches for a single pattern (the string ``foo'') inside of a single file (test.txt). We can generalize the program more by using an empty <>, rather than iterating over . An empty <> iterates through each element of @ARGV (the array containing command-line arguments), assigning each one in turn to the scalar $ARGV. If there are no command-line arguments, then <> expects to receive input from the user. Listing 2 is a revised version of the above program, which searches through multiple files for the string ``foo''. Notice how this version of the program prints the file name as well as the matching line. Since $_ already contains a newline character, we need not put one at the end of the print statement. Listing 2 could be rewritten in a single line of Perl with the following:

perl -ne 'print "$ARGV: $_" if m/foo/;' *
Listing 2

Finally, we can make our simple search a bit more sophisticated by allowing the user to name the pattern, as well as the files. Listing 3 takes the first command-line argument, removing it from @ARGV and putting it in $pattern. To tell Perl that $pattern will not change, and that it should compile the search pattern only a single time, we use m// with the /o option.

Listing 3

Thus, to search for the pattern f.[aeiou] in all of the files with a ``txt'' extension, we would use:

 ./simple-search-3.pl "f.[aeiou]" *.txt
Sure enough, every line containing an f, followed by any character, followed by a vowel is printed on the screen, preceded by a file name.

File::Find

The above would be a good skeleton for our web-based search if all documents on a web site were stored in a single directory. However, the opposite is normally the case: most web sites put files in a number of different directories. A good search program must traverse the entire web hierarchy, searching through each file in each directory.

While we could certainly accomplish this ourselves, someone has already done it for us. File::Find, a package which comes with Perl, makes it possible to create a find-like program using Perl. File::Find exports the find subroutine, which takes a list of arguments. The first argument is a subroutine reference invoked once for each file encountered. The remaining arguments should be directory and file names, which File::Find will read in sequence until it gets to the end.

For example, Listing 4 is a short program that uses File::Find to print all of the file names in a particular directory. As you can see, File::Find exports the variable $File::Find::name which contains the current file name as well as the find subroutine. The current directory is stored in $File::Find::dir.

Listing 4

Listing 5

Listing 5 contains a version of simple-find-2.pl, which uses File::Find to search through all of the files under a given directory tree. As with many programs that use File::Find, the bulk of simple-find-2.pl is spent inside of find_matches, a subroutine called once for every file encountered under the directories passed in @ARGV. To find all files containing the pattern ``f.[aeiou]'' in directories under /home and /development, type:

 ./simple-find-2.pl "f.[aeiou]" /home /development
Line 11 of simple-find-2.pl is particularly important, in that it undefines $/, the variable that determines the end-of-line character. Normally, Perl's <> operator iterates through a file line by line, returning undef when the end is reached. However, we want to search across an entire file, since a pattern might have to extend across lines. By undefining $/, the line

my $contents = ();
puts the entire contents of the file handle FILE inside of $contents, rather than just one line.

Searching via the Web

Now that we can search for a pattern through all files under a particular directory, let's connect this functionality to the Web, searching through all of the files under the HTTP server's document hierarchy. Such a program will need to receive only a pattern from the user, since the web hierarchy does not change very often.

Listing 6

Listing 6 is an HTML form that could be used to provide such input. This HTML form will submit its contents to simple-cgi-find.pl, the CGI program in Listing 7. Its parameter, pattern, contains a Perl pattern to be compared with the contents of each file in the web hierarchy, simple-cgi-find.pl will return a list of documents matching the user's pattern.

Listing 7

Unfortunately, the version of File::Find that comes with Perl does not work with the -T flag, which turns on Perl's secure tainting mode. CGI programs should always be run with -T, which ensures data from outside sources is not used in potentially compromising ways. In this case, however, we cannot run our program with -T. File::Find relies on the fastcwd routine in the Cwd module, which cannot be run successfully with -T. For the time being, I suggest using these programs without -T, but when the next version of Perl is released, I strongly recommend upgrading in order to run CGI programs with full tainting enabled.

Our search subroutine, find_matches, has been modified slightly, so that its results will be more relevant for web users. The first thing it does is to make sure the file has an extension indicating it contains HTML-formatted text or plain text. This ensures that the search will not try to view graphics files, which can contain any characters:

return unless (m/.html?$/i or m/.te?xt$/i);
Some web sites mark HTML files with extensions of .htm (or .HTM), and their text files with .txt or .TXT rather than .text. The above pattern allows for all of these variations, ignoring case with the /i switch and ensuring the suffix comes at the end of the pattern with the $ metacharacter.

After retrieving the contents of the current file, find_matches checks to see if $pattern can be found inside of $contents, which contains the document's contents. We surround $pattern with characters, to look for $pattern on word boundaries. This ensures that searching for ``foo'' will not match the word ``food'', even though the former is a subset of the latter.

If a match is found, find_matches creates a URL by substituting $search_root with $url_root, which hides the HTML document hierarchy from outside users. It then prints the file name inside a hyperlink to that URL:

if ($contents =~ m|$pattern|ios)
{
my $url = "$File::Find::dir/$filename";
$url =~ s/$search_root/$url_origin/;
print qq{
  • $filename
    }
    }
  • Improving on our Web Search

    While simple-cgi-find.pl works, it does have a few problems. For starters, it fails to differentiate between HTML tags and actual content. Searching for ``IMG'' should not match any document containing an tag, but rather any content outside of HTML tags that contains that string. For this reason, we will modify our program to remove HTML tags from the input file.

    Beginning Perl programmers often think that the best way to remove HTML tags is to remove anything between <>, as in:

    $contents =~ s|<.+>||g;
    Since ``.'' tells Perl to match any character and ``+'' tells Perl to match one or more of the preceding character, the statement above looks like it tells Perl to remove all of the HTML tags. Unfortunately, this is not the case--the statement will remove everything between the first <> appearing in the file. This is because Perl's patterns are ``greedy'', and try to maximize the number of characters they match.

    We can make ``+'' non-greedy and try to match only the minimum number of characters by placing a ? after it. For example:

    $contents =~ s|<.+?>||g;
    There is also the sticky issue of what to do if $pattern contains white space. Should it be considered as a search phrase containing one or more white-space characters? Or should it be considered several different words with an ``or'' or ``and'' search?

    Listing 8

    In this particular case, we can have our cake and eat it, too. By adding a set of radio buttons to the HTML form, we can allow the user to choose whether a search should be literal, require all search terms be found or require any one of the search terms be found.

    Now we can modify our program to handle ``phrase'' searches (as we have been doing until now), ``and'' searches (in which all of the words must appear) and ``or'' searches (in which one or more of the words must appear).

    To implement an ``and'' search, we break the elements of phrase apart by using Perl's ``split'' operator. We then count the number of words we must find, iterating over each of them and checking to see if they all exist in $contents. If $counter reaches 0, we can be sure all words appear:

    elsif ($search_type eq "and")
    {
    my @words = split /s+/, $pattern;
    my $count = scalar @words;
    foreach my $word (@words)
    {
    $count- if ($contents =~ m|$word|is);
    }
    unless ($count)
    {
    print qq{
  • $filename
    };
    $total_matches++;
    }
    }
  • An ``or'' search is even easier to implement: once again, we break apart $phrase across white space. If even one of the constituent words matches, we can immediately print the file name and hyperlink, and return from find_matches:

    elsif ($search_type eq "or")
    {
    my @words = split /s+/, $pattern;
    foreach my $word (@words)
    {
    if ($contents =~ m|$word|is)
    {
    print qq{
  • $filename
    };
    $total_matches++;
    return;
    }
    }
    }
  • Finally, we should have some way of telling the user how many documents matched. We do this by creating a new variable, $total_matches, which is incremented each time a document matches (as seen in the above code fragments for ``and'' and ``or'' searches).

    These improvements are incorporated into the search program called better-cgi-search.pl, in Listing 9, not printed here but contained in the archive file (see Resources).

    Excluding Directories and Files

    We now have a fairly full-functioned search program which can handle most types of searches people want to do. The problem is that we have created a program which might be too good to be useful. Many clients of mine put information on their web sites before it is meant to be released. Without any links leading to these directories and documents, it is unlikely someone will be able to find them. However, our search program does not depend on hyperlinks in order to find documents.

    One common solution is for a search program to ignore any directory containing a file named .nosearch. This file does not need to contain any data, since its mere existence means a directory's contents will be skipped.

    The easiest implementation would check for the existence of a .nosearch file in the directory currently being probed. However, checking for a file with each invocation of find_matches would reduce our program's already slow performance even more. It would be better if the program looked for a .nosearch file, then stored that information in a hash to be retrieved when future files in that directory are examined.

    The Other Problem

    We can solve these problems with two lines of code. The first, placed at the beginning of find_matches, returns immediately if a .nosearch file has already been found in the current directory:

    return if ($ignore_directory{$File::Find::dir});
    If we reach the second line, it means that no .nosearch file has been found for this directory. However, there are several circumstances under which a .nosearch file wasn't found, yet should still be in force: when we are examining the .nosearch file itself, when a .nosearch file is in the directory or when a .nosearch file is in the parent directory. After all, if the parent directory should not be searched, then neither should the child directory. Here is the code fragment that accomplishes this:

    # Mark the directory as ignorable ...
    $ignore_directory{$File::Find::dir} = 1
    if (($_ eq ".nosearch") ||
    (-e ".nosearch") ||
    (-e "../.nosearch"));
    Listing 10 contains a version of better-cgi-search.pl with these additions and can be found in the archive file (see Resources).

    Is This Any Way to Run a Search?

    If you have already run these programs, you most likely found the main problem with the system outlined above: it is very slow. If your web site contains 100 files, this system works just fine. However, if your site expands to 1000 or 10,000 files, users will stop the search in the middle because it will take too long.

    For this reason, most serious search engines employ a different strategy, one which separates the searching into two different stages. In the first stage, an indexing program takes the files apart, keeping track of where they might be. A second program is then run as a search client, looking through the pregenerated index for matches.

    Next month, we will examine some ways of creating such indices, as well as how to look through them. Perhaps our simple search programs will not be able to complete with Glimpse and ht://Dig, but at least we will understand roughly how they work and what trade-offs are involved when writing search programs.