Parallel and Distributed Searching
Speeding up your database searches?
By Andrew Davison
Many developers assume that parallelism and distributed programming always speed up sequential code. However, when I tested this rose-tinted view by timing the code for various search programs, the results were sobering. This article presents those results and describes how a simple search program for a 24-MB database can be parallelized and distributed across several machines. Along the way, I'll discuss the C fork()and wait()library functions, shared memory, and remote procedure calls (RPCs).
The test database holds membership details in ASCII, with one line per member, sorted by last name. Because of the size of the database, it is stored in four 6-MB files: a2f.txt, g2l.txt, m2r.txt, and s2z.txt. A search query can be given any string, and must return the total number of database lines that contain it, printing up to 40 matching lines. Since the string may appear anywhere in a member's details, a simple linear search is used to scan the database.
Using fgrep
One implementation approach is to use the UNIX fgrep (fast grep) command to print out matching lines. For example, fgrep Andrew a2f.txt will print all the lines containing "Andrew" in a2f.txt, while a linear search of the entire database would be coded as fgrep Andrewva2f.txt ; fgrep Andrew g2l.txt ; fgrep Andrew m2r.txt ; fgrep Andrew s2z.txt.
Each database component file takes about seven seconds to search; the total time for the entire database is about 27 seconds.