Search This Site
By Randal L. Schwartz
Back in my April 1997 column ("A Web-Search CGI"), I provided a simple script that searched the text of the programs I've written for this column over the years. Recently, I've been hacking my overall Web site design, and thought it would be cool to be able to search my entire site. The program of that column could do the trick, but only if I never planned on getting anything else done with my Web server box again, because it would be expensive to search everything.
But I thought to myself, hey, the big search engines have already come to my site, fetched all the pages I want to have searched, and indexed them for me. Furthermore, they have more spare CPU cycles than I have, and it'd be nice to take advantage of that.
And then I remembered that many of the search engines provide a way to instruct the returned values to have a specific URL or site value. I could use this to my advantage to create a wrapper that uses the big search engine to return hits only on my site!
The upside of this approach is that I leverage off existing work, and someone else's disk and CPU. The downside is that the spiders don't visit very often, so new material is likely to be missed in such an index. But for mostly static or old pages, the trade-off is often worth debating.
Of course, Perl can pass the proper values in to the search engine's form-response CGI programs, but the answer comes back as HTML. It's a mess to figure out which part of that HTML is a link to some hit, and which part is simply a link to an ad or something.<>